Workers stop generating data

peastman commented 3 months ago

A couple of times recently, I've had the status for a dataset show that the number of running jobs was decreasing rapidly, but when I checked my workers I found that lots of them were still running. They just weren't generating any more data. When I checked the logs for them, they contained this error message:

Traceback (most recent call last):
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/bin/qcfractal-compute-manager", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcfractalcompute/compute_manager_cli.py", line 62, in main
    manager.start()
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcfractalcompute/compute_manager.py", line 285, in start
    self.scheduler.run(blocking=True)
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/sched.py", line 151, in run
    action(*argument, **kwargs)
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcfractalcompute/compute_manager.py", line 274, in scheduler_heartbeat
    self.heartbeat()
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcfractalcompute/compute_manager.py", line 338, in heartbeat
    self.client.heartbeat(
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcportal/manager_client.py", line 119, in heartbeat
    return self._update_on_server(manager_update)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcportal/manager_client.py", line 53, in _update_on_server
    return self.make_request(
           ^^^^^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcportal/client_base.py", line 408, in make_request
    r = self._request(
        ^^^^^^^^^^^^^^
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcportal/client_base.py", line 373, in _request
    raise PortalRequestError(f"Request failed: {details['msg']}", r.status_code, details)
qcportal.client_base.PortalRequestError: Request failed: Cannot update resource stats for manager sherlock-sh03-10n31.int-e8631247-e875-41b3-8a82-d9e51ec9cdbf - is not active (HTTP status 400)
slurmstepd: error: Detected 30 oom_kill events in StepId=44308364.batch. Some of the step tasks have been OOM Killed.

The only way I can get them working again is to cancel all my running jobs and restart them.

peastman commented 3 months ago

And they just did it again, the second time today.

peastman commented 3 months ago

All jobs I try to start are now immediately failing with that error. I can't run any calculations.

bennybp commented 3 months ago

Well that's not good. I'm not seeing anything particularly concerning server side, but it seems like something is happening on your side.

Could you post/send the logfile for one of the failed managers?

peastman commented 3 months ago

Logs are attached.

slurm-44465773.out.txt qcfractal-manager-44465773.log

bennybp commented 3 months ago

Oh sorry, this is my fault! It had to do with two separate processes handling the manager heartbeats, where one had the heartbeat frequency set incorrectly.

I've shut down the second process. Hopefully things are working now

peastman commented 3 months ago

Thanks! I submitted a new set of jobs. I'll let you know what happens.

peastman commented 3 months ago

Things seem to be running properly again. Thanks!

MolSSI / QCFractal

Workers stop generating data #812