Re-construct database files from primary data

davidlmobley commented 9 years ago

This is implied by several other issues, but ought to exist as its own issue.

To complete this issue, we must first resolve Issue #14, #13 , and #19.

This issue must be completed before we can resolve Issues #15 and #16 .

Resolving this issue will also provide the best resolution of the issue with water molecules in topology files (# to be inserted here).

kyleabeauchamp commented 9 years ago

I'm not claiming that my code is ready for prime time yet, but I do have a somewhat automated pipeline for building the AMBER files using OpenEye and Antechamber.

https://github.com/choderalab/gbff/blob/master/code/rebuild_freesolv.py

jchodera commented 9 years ago

I think we decided to not use gaff2xml for this purpose until we have a stable public API, so my plan was to ingest the dependencies into a standalone script for FreeSolv until gaff2xml (or some other tool) stabilized and we could reduce complexity by using these external functions.

kyleabeauchamp commented 9 years ago

Agreed. But we may still want to check that the standalone script does the same thing as the latest gaff2xml functions. A lot of the code we have floating around differs in the various steps used to prepare ligands, which leads to different charges. In particular, the setSampleHydrogens.

Another possibilty is for me to roll out a release of gaff2xml. Even if the API is not stable, we can still just specific a hard-coded version number for installation via conda. I'm happy with whatever here.

davidlmobley commented 9 years ago

What's the issue with setSampleHydrogens? Is it an only an issue with ambiguity about protonation state/tautomer state? The compounds here in general are quite straightforward at this point; there are only a handful where there would be any concern at all about tautomer state.

John, what's your preference on the approaches Kyle suggested?

Also, I should point out that Chris Bayly just pointed me to the documentation on how to generate canonical charges using his multiconformer procedure - https://docs.eyesopen.com/toolkits/cookbook/python/modeling/am1-bcc.html. This is what we should be using for charging.

On Sat, Apr 18, 2015 at 10:17 AM, Kyle Beauchamp notifications@github.com wrote:

Agreed. But we may still want to check that the standalone script does the same thing as the latest gaff2xml functions. A lot of the code we have floating around differs in the various steps used to prepare ligands, which leads to different charges. In particular, the setSampleHydrogens.

Another possibilty is for me to roll out a release of gaff2xml. Even if the API is not stable, we can still just specific a hard-coded version number for installation via conda. I'm happy with whatever here.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/FreeSolv/issues/20#issuecomment-94184398.

David Mobley dmobley@gmail.com 949-385-2436

jchodera commented 9 years ago

Christopher Bayly's canonical AM1-BCC recipe, which we have now adopted, uses

    omega.SetSampleHydrogens(True)

which @kyleabeauchamp has found leads to modest charge differences for ethanol due to placement of the proton.

@kyleabeauchamp has also found that

omega.SetRMSThreshold(rms)

seems to behave independently of the specified rms value, and omega.SetRMSThreshold(0) behaves completely differently from omitting a call to this method altogether. We're not yet sure if this is a bug. Using or omitting this line leads to charge differences for ethyl acrylate.

I think we can either use a specific gaff2xml version to regenerate these or ingest the methods that we are using into scripts here---either method will work from a technical standpoint.

@kyleabeauchamp makes a good point about testing, however---gaff2xml currently has an automated testing framework to ensure correct behavior of its methods, while this project does not yet have automated testing framework.

The biggest question might therefore be "What kind of testing do we need to ensure correctness of FreeSolv?". It seems that not much in the way of testing process existed previously (which is why all of these errors have come to light). Perhaps we want to figure out how we want things to be tested first, and that will make the path forward clearer?

davidlmobley commented 9 years ago

What is being done about the omega.SetRMSThreshold issue you mentioned? Have you talked with support & Bayly?

Can you elaborate on how gaff2xml is tested and how that might be extended to FreeSolv?

There is no obvious way that I can tell to check that the topology/coordinate files generated are in fact correct, though we can certainly check a number of side issues:

Do the number of water molecules/water atoms in the topology file and coordinate file match, and match what's expected?
Do the charges in the topology file match the expected charges, within rounding?
Do the atom types in the topology file match those in the AMBER-atom-typed .mol2 file from antechamber?
others?
Do the number of atoms in the topology file match what's expected?

You could also do more complicated testing, such as:

If you turn the resulting topology file back into an OE molecule (which would require writing a parser) do you get the same isomeric SMILES as the original molecule?

On Mon, Apr 20, 2015 at 10:02 AM, John Chodera notifications@github.com wrote:

Christopher Bayly's canonical AM1-BCC recipe http://docs.eyesopen.com/toolkits/cookbook/python/modeling/am1-bcc.html, which we have now adopted, uses
omega.SetSampleHydrogens(True)
which @kyleabeauchamp https://github.com/kyleabeauchamp has found leads to modest charge differences for ethanol due to placement of the proton.

@kyleabeauchamp https://github.com/kyleabeauchamp has also found that

omega.SetRMSThreshold(rms)

seems to behave independently of the specified rms value, and omega.SetRMSThreshold(0) behaves completely differently from omitting a call to this method altogether. We're not yet sure if this is a bug. Using or omitting this line leads to charge differences for ethyl acrylate.

I think we can either use a specific gaff2xml version to regenerate these or ingest the methods that we are using into scripts here---either method will work from a technical standpoint.

@kyleabeauchamp https://github.com/kyleabeauchamp makes a good point about testing, however---gaff2xml currently has an automated testing framework to ensure correct behavior of its methods, while this project does not yet have automated testing framework.

The biggest question might therefore be "What kind of testing do we need to ensure correctness of FreeSolv?". It seems that not much in the way of testing process existed previously (which is why all of these errors have come to light). Perhaps we want to figure out how we want things to be tested first, and that will make the path forward clearer?

— Reply to this email directly or view it on GitHub https://github.com/choderalab/FreeSolv/issues/20#issuecomment-94509994.

David Mobley dmobley@gmail.com 949-385-2436

jchodera commented 9 years ago

What is being done about the omega.SetRMSThreshold issue you mentioned? Have you talked with support & Bayly?

@kyleabeauchamp emailed Christopher Bayly on 11 Apr 2015, and I've just pinged him again today. We can email OpenEye support for clarification if that doesn't work.

Can you elaborate on how gaff2xml is tested and how that might be extended to FreeSolv?

There is a suite of nosetests used to test individual functions and run some datasets through them: https://github.com/choderalab/gaff2xml/tree/master/gaff2xml/tests

There is no obvious way that I can tell to check that the topology/coordinate files generated are in fact correct,

Supposing technical constraints were not a problem, what would the true test of "correctness" be? Would it be identical free energies of hydration in each program (gromacs, AMBER, etc)? Identical potential energies for various conformations?

we can certainly check a number of side issues:

All of these things (many of which caused trouble in the past) would be good to test.

If you turn the resulting topology file back into an OE molecule (which would require writing a parser) do you get the same isomeric SMILES as the original molecule?

@swails has a neat tool called ParmEd that is very much getting close to being able to do a lot of these tests, so it's possible we can utilize it in a testing framework.

davidlmobley commented 9 years ago

There is no obvious way that I can tell to check that the topology/coordinate files generated are in fact correct,

Supposing technical constraints were not a problem, what would the true test of "correctness" be? Would it be identical free energies of hydration in each program (gromacs, AMBER, etc)? Identical potential energies for various conformations?

When I said "correct", what I had in mind was actually a different issue - that the topology in question contains the intended set of force field parameters for the intended molecule with the intended protonation/tautomer state. For the sake of terminology, let's call the issue you're raising "consistent" rather than correct. That is, we would like FreeSolv to contain topology/coordinate files which are both correct (containing the intended parameters for the intended molecule as just described) AND consistent (across file formats). I think "correctness" is probably harder (in theory) to test than "consistent", and I don't know how to check it in general except the specific aspects I already mentioned.

Consistency in principle ought to be able to be verified by identical free energies of hydration and identical potential energies. However, this is really still a research issue (i.e. we have a separate project trying to get reproducibility of hydration free energies across several different codes, and it is not straightforward. This also looks some at potential energies, and a similar thing is true there). Potential energies are certainly the easier of the two and probably ought to be our benchmark here, at least for reference configurations between AMBER and GROMACS files.

David Mobley dmobley@gmail.com 949-385-2436

jchodera commented 9 years ago

I think "correctness" is probably harder (in theory) to test than "consistent", and I don't know how to check it in general except the specific aspects I already mentioned.

Excellent points.

I think it might be practical to check consistency via some potential energy comparisons (especially with tools the OpenMM app and its file readers or ParmEd to help), and correctness via Topology -> OpenEye OEMol -> canonical SMILES, at least for now.

davidlmobley commented 9 years ago

Yes, I think that ought to be the goal.

On Mon, Apr 20, 2015 at 11:14 AM, John Chodera notifications@github.com wrote:

I think "correctness" is probably harder (in theory) to test than "consistent", and I don't know how to check it in general except the specific aspects I already mentioned.

Excellent points.

I think it might be practical to check consistency via some potential energy comparisons (especially with tools the OpenMM app and its file readers or ParmEd to help), and correctness via Topology -> OpenEye OEMol -> canonical SMILES, at least for now.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/FreeSolv/issues/20#issuecomment-94527997.

David Mobley dmobley@gmail.com 949-385-2436

davidlmobley commented 7 years ago

This is basically resolved by #28 but can be re-opened later if there is anything we need to reawaken here. A lot of the energetic consistency stuff has been handled elsewhere via ParmEd and in tests for openmoltools.

I intend to work with Michael Shirts separately to get in input files in other formats such as DESMOND.

MobleyLab / FreeSolv

Re-construct database files from primary data #20