Closed PeterKraus closed 4 years ago
Thank you @PeterKraus!
I've noticed behavior on some of our QCArchive jobs recently that is consistent with this. The QCFractal manager keeps a job running Psi4 via QCEngine in an INCOMPLETE state indefinitely, and local tests give me the impression that Psi4 has died but no error is being reported upward by the Psi4 QCEngine harness.
It looks like the Psi4 harness call doesn't currently take in parameters meant to guard against this, such as timeout
and interrupt_after
. I'm in favor of adding these as things taken in via the AtomicInput
schema. I'm not certain that this will solve the problem, however.
@bennybp I think this may be the same problem we have been observing for our QCArchive INCOMPLETEs. I'm putting this on my list to address with a PR.
Actually, this is with psi4-1.4a2.dev523
, so the relevant psi4 harness is here:
I am not knowledgeable enough about the python subprocess libraries to understand what happens when proc["proc"].wait()
is called on a process that throws an extension, but my guess would be that that's the issue. Manually setting a timeout (ie proc["proc"].wait(X)
) crashes the job as directed...
I think this is related to https://github.com/psi4/psi4/pull/1933
There, psi4 was getting hung up on certain errors (due to a memory corruption). The new builds of psi4 are available on conda so see if that helps!
Yeah, I can confirm this is fixed in psi4-1.4a2.dev723
.
Awesome, thank you both!
This seems a strange issue that happens on some installations of Psi4/qcng and not on others. When running a Psi4 calculation via the QCSchema interface,
and the calculation happens to crash due to e.g. convergence failure,
the qcng wrapper doesn't consume the exception appropriately and remains in a wait loop. I've tracked the issue to:
https://github.com/MolSSI/QCEngine/blob/8ee06c5b4ad74da5a9591baad0a8a53e73cc1974/qcengine/util.py#L479
But I am not sure why the subprocess doesn't terminate on exception.