MolSSI / QCEngine

Quantum chemistry program executor and IO standardizer (QCSchema).
https://molssi.github.io/QCEngine/
BSD 3-Clause "New" or "Revised" License
166 stars 80 forks source link

Psi4 doesn't crash QCEngine on PsiExceptions #250

Closed PeterKraus closed 4 years ago

PeterKraus commented 4 years ago

This seems a strange issue that happens on some installations of Psi4/qcng and not on others. When running a Psi4 calculation via the QCSchema interface,

ret = qcng.compute(injson, "psi4", return_dict=True,
                           local_options={"memory": args.memory,
                                          "ncores": args.nthreads,
                                          "scratch_directory": "$PSI_SCRATCH"})

and the calculation happens to crash due to e.g. convergence failure,

   @DF-RHF iter  99:   -54.34079555610076    3.48166e-13   1.66606e-11 DIIS
   @DF-RHF iter 100:   -54.34079555610077   -1.42109e-14   1.43652e-11 DIIS

PsiException: Could not converge SCF iterations in 100 iterations.

  Failed to converge.

the qcng wrapper doesn't consume the exception appropriately and remains in a wait loop. I've tracked the issue to:

https://github.com/MolSSI/QCEngine/blob/8ee06c5b4ad74da5a9591baad0a8a53e73cc1974/qcengine/util.py#L479

But I am not sure why the subprocess doesn't terminate on exception.

dotsdl commented 4 years ago

Thank you @PeterKraus!

I've noticed behavior on some of our QCArchive jobs recently that is consistent with this. The QCFractal manager keeps a job running Psi4 via QCEngine in an INCOMPLETE state indefinitely, and local tests give me the impression that Psi4 has died but no error is being reported upward by the Psi4 QCEngine harness.

It looks like the Psi4 harness call doesn't currently take in parameters meant to guard against this, such as timeout and interrupt_after. I'm in favor of adding these as things taken in via the AtomicInput schema. I'm not certain that this will solve the problem, however.

dotsdl commented 4 years ago

@bennybp I think this may be the same problem we have been observing for our QCArchive INCOMPLETEs. I'm putting this on my list to address with a PR.

PeterKraus commented 4 years ago

Actually, this is with psi4-1.4a2.dev523, so the relevant psi4 harness is here:

https://github.com/MolSSI/QCEngine/blob/8ee06c5b4ad74da5a9591baad0a8a53e73cc1974/qcengine/programs/psi4.py#L206-L216

I am not knowledgeable enough about the python subprocess libraries to understand what happens when proc["proc"].wait() is called on a process that throws an extension, but my guess would be that that's the issue. Manually setting a timeout (ie proc["proc"].wait(X)) crashes the job as directed...

bennybp commented 4 years ago

I think this is related to https://github.com/psi4/psi4/pull/1933

There, psi4 was getting hung up on certain errors (due to a memory corruption). The new builds of psi4 are available on conda so see if that helps!

PeterKraus commented 4 years ago

Yeah, I can confirm this is fixed in psi4-1.4a2.dev723.

dotsdl commented 4 years ago

Awesome, thank you both!