All jobs are failing - Githubissues

peastman commented 8 months ago

A few hours ago, all my workers started failing every single task. They report a timeout trying to query the available programs.

Traceback (most recent call last):
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcfractalcompute/run_scripts/qcengine_list.py", line 12, in <module>
    progs = {x: qcengine.get_program(x).get_version() for x in qcengine.list_available_programs()}
  File "/home/users/peastman/miniconda3/envs/qcfractalcompute/lib/python3.11/site-packages/qcfractalcompute/run_scripts/qcengine_list.py", line 12, in <dictcomp>
    progs = {x: qcengine.get_program(x).get_version() for x in qcengine.list_available_programs()}
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/qcengine/programs/psi4.py", line 90, in get_version
    exc["proc"].wait(timeout=30)
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/subprocess.py", line 1951, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/bin/psi4', '--version']' timed out after 30 seconds

I've shut down all my workers to keep more tasks from failing. What could be causing this?

peastman commented 8 months ago

That was the error in the log. When I queried one of the records this is the error it reported.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/record_models.py", line 565, in error
    return self._get_output(OutputTypeEnum.error)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/record_models.py", line 498, in _get_output
    history = self.compute_history
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/record_models.py", line 528, in compute_history
    self._fetch_compute_history()
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/record_models.py", line 449, in _fetch_compute_history
    self.compute_history_ = self._client.make_request(
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 358, in make_request
    r = self._request(method, endpoint, body=serialized_body, url_params=parsed_url_params)
  File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 304, in _request
    raise ConnectionRefusedError(_connection_error_msg.format(self.address)) from None
ConnectionRefusedError: 

Could not connect to server https://ml.qcarchive.molssi.org/, please check the address and try again.

Is something wrong with the server?

bennybp commented 8 months ago

I am not seeing anything particularly wrong. It is still up and was handling requests for much of the night. Is it still an issue?

peastman commented 8 months ago

It's still failing every task with the timeout error. See record 119089075 for an example.

bennybp commented 8 months ago

Oh I see, I think.

The first error is something going wrong with psi4 starting up. I have seen this before on our clusters, where sometimes the network disk is very slow. If you activate the environment yourself and run psi4 --version does it take a long time?

I interpreted your second message to be what you were consistently hitting. That seems to have been a transient issue.

peastman commented 8 months ago

That seems to be the case. It took 59 seconds to complete.

peastman commented 8 months ago

Is there anything that can be done about it? The line from the stack trace above

  File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/qcengine/programs/psi4.py", line 90, in get_version
    exc["proc"].wait(timeout=30)

shows that the timeout is hardcoded at 30 seconds.

bennybp commented 8 months ago

I suppose you could try changing that timeout value. It's a little hacky, but would be fairly harmless.

I just tried it on my own cluster, and it took ~25 seconds. I know we are having issues with our network storage being slow, which is magnified when importing python packages (where every import basically means fetching a file from the network).

It could be your cluster is under heavy load, which could be something to contact the sysadmin about.

Another possible workaround just came to mind: In the slurm script, you could activate the environment and run psi4 --version. This should fetch all the files and then they should be cached in memory. On my cluster, it took only ~1 second the 2nd time around. Just make sure you then activate the correct environment when starting the manager

peastman commented 8 months ago

I modified psi4.py to increase the timeout to 300 seconds, and now tasks are succeeding again. Thanks!

How about increasing it in the standard version? Since this problem has been seen on multiple clusters, 30 seconds is clearly too low.

peastman commented 8 months ago

Well, some of them are succeeding. Some are still failing even with a 300 second timeout. :(

loriab commented 8 months ago

The psi4 1.8.2 _1 builds have a patch that shortcuts import psi4 and reports the string directly for psi4 --version. They're up on c-f as of this afternoon. Hopefully this fixes the trouble.

bennybp commented 7 months ago

Have things improved with the newer psi4? I think between that and manually increasing the timeout this should be fixed

peastman commented 7 months ago

So far things are working.

bennybp commented 7 months ago

Great! Reopen if you still have issues :)

MolSSI / QCFractal

All jobs are failing #780