Closed peastman closed 7 months ago
That was the error in the log. When I queried one of the records this is the error it reported.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/record_models.py", line 565, in error
return self._get_output(OutputTypeEnum.error)
File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/record_models.py", line 498, in _get_output
history = self.compute_history
File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/record_models.py", line 528, in compute_history
self._fetch_compute_history()
File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/record_models.py", line 449, in _fetch_compute_history
self.compute_history_ = self._client.make_request(
File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 358, in make_request
r = self._request(method, endpoint, body=serialized_body, url_params=parsed_url_params)
File "/Users/peastman/miniconda3/envs/qcportal/lib/python3.9/site-packages/qcportal/client_base.py", line 304, in _request
raise ConnectionRefusedError(_connection_error_msg.format(self.address)) from None
ConnectionRefusedError:
Could not connect to server https://ml.qcarchive.molssi.org/, please check the address and try again.
Is something wrong with the server?
I am not seeing anything particularly wrong. It is still up and was handling requests for much of the night. Is it still an issue?
It's still failing every task with the timeout error. See record 119089075 for an example.
Oh I see, I think.
The first error is something going wrong with psi4 starting up. I have seen this before on our clusters, where sometimes the network disk is very slow. If you activate the environment yourself and run psi4 --version
does it take a long time?
I interpreted your second message to be what you were consistently hitting. That seems to have been a transient issue.
That seems to be the case. It took 59 seconds to complete.
Is there anything that can be done about it? The line from the stack trace above
File "/home/users/peastman/miniconda3/envs/qcfractal-worker-psi4-18.1/lib/python3.10/site-packages/qcengine/programs/psi4.py", line 90, in get_version
exc["proc"].wait(timeout=30)
shows that the timeout is hardcoded at 30 seconds.
I suppose you could try changing that timeout value. It's a little hacky, but would be fairly harmless.
I just tried it on my own cluster, and it took ~25 seconds. I know we are having issues with our network storage being slow, which is magnified when importing python packages (where every import basically means fetching a file from the network).
It could be your cluster is under heavy load, which could be something to contact the sysadmin about.
Another possible workaround just came to mind: In the slurm script, you could activate the environment and run psi4 --version
. This should fetch all the files and then they should be cached in memory. On my cluster, it took only ~1 second the 2nd time around. Just make sure you then activate the correct environment when starting the manager
I modified psi4.py to increase the timeout to 300 seconds, and now tasks are succeeding again. Thanks!
How about increasing it in the standard version? Since this problem has been seen on multiple clusters, 30 seconds is clearly too low.
Well, some of them are succeeding. Some are still failing even with a 300 second timeout. :(
The psi4 1.8.2 _1
builds have a patch that shortcuts import psi4
and reports the string directly for psi4 --version
. They're up on c-f as of this afternoon. Hopefully this fixes the trouble.
Have things improved with the newer psi4? I think between that and manually increasing the timeout this should be fixed
So far things are working.
Great! Reopen if you still have issues :)
A few hours ago, all my workers started failing every single task. They report a timeout trying to query the available programs.
I've shut down all my workers to keep more tasks from failing. What could be causing this?