Closed peastman closed 3 months ago
And they just did it again, the second time today.
All jobs I try to start are now immediately failing with that error. I can't run any calculations.
Well that's not good. I'm not seeing anything particularly concerning server side, but it seems like something is happening on your side.
Could you post/send the logfile for one of the failed managers?
Logs are attached.
Oh sorry, this is my fault! It had to do with two separate processes handling the manager heartbeats, where one had the heartbeat frequency set incorrectly.
I've shut down the second process. Hopefully things are working now
Thanks! I submitted a new set of jobs. I'll let you know what happens.
Things seem to be running properly again. Thanks!
A couple of times recently, I've had the status for a dataset show that the number of running jobs was decreasing rapidly, but when I checked my workers I found that lots of them were still running. They just weren't generating any more data. When I checked the logs for them, they contained this error message:
The only way I can get them working again is to cancel all my running jobs and restart them.