Multiple conformations in a single file?

leeping commented 6 years ago

As a force field developer and quantum chemistry user, I often find myself working with collections of structures (conformations) and associated energies. This could be useful for torsion drives in 1D and 2D, as well as reaction energies / minimum energy paths. Often I am also interested in running the same quantum chemistry method on the whole set of conformations. Thus, I think it would be very helpful if the schema could support this.

dgasmith commented 6 years ago

I believe this is part of the change where the base schema starts with a list and looks like the following:

[
    "spec_version",
    {
      ...input_one
    },
    {
      ...input_two
    },
    ...
]

While I do like this my main concern is making various QM programs actually execute this. Can @saromleang or @loriab weigh in?

davidlmobley commented 6 years ago

Totally agree with @leeping here. @vtlim in my group also uses this aspect a lot, and @bannanc likely will as well.

dgasmith commented 6 years ago

I think Psi4 could natively support this, but other codes would require calling wrappers which might as well be moved to other program layers rather than baking into the spec itself. Can I get other QM devs to weigh in here?

wadejong commented 6 years ago

NWChem has Python in the input deck that can be used to loop over configurations.

On Jan 17, 2018, at 1:55 PM, Daniel Smith notifications@github.com wrote:

I think Psi4 could natively support this, but other codes would require calling wrappers which might as well be moved to other program layers rather than baking into the spec itself. Can I get other QM devs to weigh in here?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

loriab commented 6 years ago

I once had the view that what a QC input file could support, a job schema should support. I've since withdrawn to a single schema job should support what quantum-chemically is a single logical unit, so a whole SAPT is one job, but a CCSD followed by a CISD is two jobs, even if SCF is shared between them. That can keep the job spec schema from getting too combinatorial – loop over these molecules, doing these methods, at all these basis sets, and at each of this list of torsion angles. Psi could do that job, but I'm reluctant to see it do-able in the job schema immediately facing a QC program. Better that that should be driven by the next layer up in the workflow.

saromleang commented 6 years ago

I don't believe GAMESS is setup to take in a batch of inputs and process them. A lot of work would need to be done within GAMESS to allow this (not saying that there isn't any benefit to it).

leeping commented 6 years ago

A set of configurations with the same atomic symbols / charge / multiplicity / method is a common kind of calculation; it's also a convenient unit of data to include in a database, because the user is likely going to request the entire set rather than just one configuration. That was my starting point for requesting this feature.

I appreciate the concerns of the QM developers. It's a major task to make a quantum chemistry code support this kind of batch processing if it doesn't already. My hope is that requesting a feature in the schema is not equivalent to requesting this feature in all QM codes.

Maybe the "set of multiple conformations" should have a variable name other than "geometry", such as:

json_molecule["geometries"] = [[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 2]]

If "geometries" is provided, then "geometry" should not be provided. That way, the QM codes that support batch processing will loop over the configurations, and those that don't can simply throw an error. What do you think?

loriab commented 6 years ago

Can multiple job spec documents differing in geometry serve the same role? It's considerable redundancy for the workflows you're planning, but it's pretty modest compared to the output and cost of QC calculations. So long as the runtime database is indexable by molecule identity, the records should be readily grouped. Conformations can then be associated even if they came from different input files.

leeping commented 6 years ago

I think multiple job spec documents could serve the same role, similar to how a stack of looseleaf pages can play a similar role as a book. It's mainly a matter of organization and convenience, and having the technology to bind the book can save a lot of time.

langner commented 6 years ago

I tend agree with @loriab - it will be much easier to implement a simple schema that covers a single unit of computation. But how do we intend to deal with multiple confirmations when they occur in single jobs, for example geometry optimization? Surely the output should be in one file. Perhaps we could extend the design for these types of cases so they would support more generic cases?

leeping commented 6 years ago

I certainly understand @loriab 's concern that a single conformation makes the most sense as a single unit of computation. On the other hand, an array of single-point calculations (sharing atomic symbols / method / charge / multiplicity) is becoming increasingly common and important. There is currently no easy and standard way to manage these arrays, leading to a lot of overhead in doing this research.

I'm mainly asking for this feature as an organizational tool, which would enable us to store the data in one file, have our data-processing programs process a single file instead of looping over multiple ones, store an array of single-point calculations as one entry in a database, and refer to the whole array using one key.

It would also be great if QM codes could support running an array of single-point calculations as a feature, but that's not what I'm directly interested in. I would implement this behavior in the codes I contribute to, but I wouldn't go as far as to request it in every single code. A small script could be provided to split the job array file into multiple single units, or multiple outputs may be combined into one file.

More generally, if we request such "organizational features" in the schema, is it equivalent to requesting the same kind of organization in the QM codes that use it?

vtlim commented 6 years ago

This feature would definitely be useful for me as well -- running the same geometry optimization scheme on a large array of conformations (10s to 1000s of conformations per molecule). If I want to perform additional optimizations or visualizations, that requires me to iterate through the conformations' directories a number of times. Supporting multiple conformations would make it easier for processing and maintenance.

jchodera commented 6 years ago

It would indeed be extremely useful for JSON files intended to specify input for quantum chemical calculations to list a number of configurations on which the same operation is to be performed. If this is difficult for all programs to support natively, could this just be added as a separate Tier of spec support? It would seem like a simple Python driver would be sufficient to then act as a harness for codes that conform to a lower level Tier of the spec.

Another relevant question: If the JSON spec would also provide a way to associate the output with the inputs, how would the mapping from calculation outputs to input configuration be managed? And if several configurations were produced by a single input (e.g., for geometry optimization), we also have to worry about the association between each input configuration and many output configurations (as well as, for example, other associated properties with every output configuration).

dgasmith commented 6 years ago

I think this goes back to what the scope of the schema project really is. Talking to folks who have implemented successful schema's before have indicated that projects which are narrow in scope have a much (much!) higher chance of being successful. Having this spec as a simple "API" for single QC applications seems complex enough (to me).

A few downsides to implementing "workflows" in the spec:

Whats happens if the QC program crashes on computation number 49/50?
Putting multiple computations through a single QC program makes parallelizing over these computations harder.
How do we link multiple inputs and outputs and what happens with nested outputs? (as John mentioned)
We create further divisions in the spec between QM programs that support this and those that do not.

I do understand that this ability is very useful; however, I believe this would best take place at a higher workflow level. This seems to be an extremely useful aspect, as John mentioned could we move this to a higher level "workflow" schema?

Im happy to be convinced otherwise here, I mostly worry that increasing the scope and complexity of this project makes its first version date even further out (if ever).

davidlmobley commented 6 years ago

I think I'm inclined to agree that, while useful, this should be punted to further down the line or a higher workflow level.

leeping commented 6 years ago

Sounds good - I agree it's best to focus on making something narrow in scope that works. A higher-level workflow schema would be a good way forward.

MolSSI / QCSchema

Multiple conformations in a single file? #34