cclib & jumbo - Githubissues

ltalirz commented 2 years ago

I recently came across the parallel "Jumbo converters" in ioChem-BD for quantum chemistry code outputs to CML (Java codebase).

As mentioned here, there is a significant overlap in the codes supported (ADF, Gaussian, Molcas, MOPAC, Orca, Turbomole), so perhaps there might be an opportunity to share some of the work involved in keeping up with the atomistic software space.

At the same time, I realize of course that this is not always straightforward (programming language differences aside) and I notice some previous discussion in https://github.com/cclib/cclib/issues/163#issuecomment-70946410 .

Just intended as "food for thought" - feel free to close.

berquist commented 2 years ago

I remember reading about J-C when first implementing the CML writer. A couple of things:

I did not realize at the time (2014) that J-C was hierarchical, since that was early in my programming journey and the original repository (https://bitbucket.org/wwmm/jumbo-converters, I still have a copy) did not have documentation. I completely agree that the hierarchical approach is conceptually the best, but aside from the comments about what's hard with that approach, cclib as it exists today could not convert to that without a complete rewrite.
- We are working on parsing a "program within a program", as well as considering breaking apart single outputs into more logical chunks (maybe something that is an exploration of the potential energy surface is really multiple data objects, like QCSchema is designed for). This is intermediate between the fully procedural and fully hierarchical approaches and hopefully a good compromise (#657).
- Similarly, because of how the package has evolved, we are relatively consistent in what is parsed among many packages, but the total scope is relatively small and it is difficult to generalize. A new model is in the works that will allow for much better extensibility (#419).
As part of the lack of documentation on J-C, I think I just tried mimicking the CML output from a quantum chemistry output converted by Open Babel or Avogadro 1.x. I wouldn't be surprised if our CML output doesn't validate against the schema.

ghutchis commented 2 years ago

I doubt the OB implementation (or the Avogadro2 implementation for that matter) validate with CML spec anymore. At one point Peter Murray-Rust worked with the OB code and made sure it validated, but then started changing the spec so frequently, it was impossible to keep up.

IIRC the Jumbo Converters haven't been touched in years - the official repo is now here: https://github.com/BlueObelisk/jumbo-converters

While CML exists "in the wild" I think it's better to push on a community standard - whether that's QCSchema or something similar outside of MolSSI is worth discussing.

From my perspective, cclib is the de facto standard for parsing comp chem files and other projects should help or leverage it.

berquist commented 2 years ago

(The remainder is my view and opinion, not necessarily that of the cclib project, so separate comment.) The field that cclib lives in suffers from a substantial amount of redundant work, at least in Python. It appears in the work of graduate students who write one-off ad-hoc solutions for parsing their files, and the data they need may or may not be parsed by cclib. In this case, it's a matter of awareness and discoverability: sure, we are Googleable, but we don't do much active outreach. This is the group we probably reach the most and benefits the most.

Collaborating across the language boundary will be tough. The best we could do at the moment is to produce a Chemical JSON or CML file that the Java process could grab from calling CPython (I have no idea if cclib will work under Jython). My vision is that much cclib functionality is subsumed by a package that compiles to a portable compiled library which exposes a C API, so that interfacing across languages becomes reasonable. In a perfect world we would no longer parse output files meant for human eyes as a form of data interchange, but it's not clear when that will stop.

cclib / cclib

cclib & jumbo #1114