MobleyLab / FreeSolv

Experimental and calculated small molecule hydration free energies
http://www.escholarship.org/uc/item/6sd403pz
106 stars 53 forks source link

Strip water molecules from all topology/coordinate files in current database #21

Closed davidlmobley closed 7 years ago

davidlmobley commented 9 years ago

Because of prior manual curation of files, not all topology and coordinate files contain water molecules. And additionally, I just found out (from Sereina Riniker - e-mail excerpt below) that some of these contain TIP4P-Ew water molecules rather than TIP3P. Again, this is a result of manually gathering the topology/coordinate files for these (in some cases by students). The best long-term solution is to re-generate all topology/coordinate files from original source data (Issue #20), but an interim solution is just to strip all water molecules from existing topology/coordinate files.

Riniker's e-mail said this, in part: "Regarding the [input files] I noticed two things which I thought you might like to know if you do not already. In the most recent version v0.31, I encountered 78 molecules where the GROMACS coordinate file .gro does not contain the solvent coordinates. In addition, there are 23 molecules where the solvent model in the coordinate file is not TIP3P (it contains 4 coordinates per solvent molecule). I attach the list of molecule numbers in case you would like to have a look at them."

The compound ID numbers for setups with TIP4P are: 1323538 1728386 186894 1873346 1875719 1923244 2005792 2049967 20524 2068538 2178600 2972906 3053621 3727287 3738859 4035953 511661 5157661 525934 5449201 8427539 9055303 9979854

And those for setups with no water are: 1034539 1160109 1469079 172879 1893815 1905088 1944394 2126135 2316618 242480 2484519 2492140 2613240 2636578 2659552 2844990 2845466 2850833 2960202 2972345 3040612 3083321 3211679 3265457 3269819 3359593 3515580 3686115 3802803 3976574 4149784 4371692 4479135 4587267 4603202 4613090 4678740 4689084 486214 4936555 5003962 5006685 5282042 5371840 5456566 5510474 5538249 5561855 5616693 5917842 6102880 6190089 6195751 6198745 628951 6359156 667278 6688723 6935906 7239499 7417968 7676709 7913234 8052240 819018 8208692 8311303 8337722 8823527 8827942 8883511 9257453 9510785 9653690 9717937 9741965 9821936 9897248

jchodera commented 9 years ago

Which topology/coordinate files in particular are of interest? The Amber ones?

I might have some time to make progress on this today.

davidlmobley commented 9 years ago

This should be the GROMACS ones, as I always solvated things after converting to GROMACS.

On Fri, Apr 17, 2015 at 10:25 AM, John Chodera notifications@github.com wrote:

Which topology/coordinate files in particular are of interest? The Amber ones?

I might have some time to make progress on this today.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/FreeSolv/issues/21#issuecomment-94031914.

David Mobley dmobley@gmail.com 949-385-2436

jchodera commented 9 years ago

Is a workflow in which we first solvate in AMBER tleap and then use acpype to convert to gromacs acceptable, or would that generate undesirable topology files?

Also, if there's already an issue on the preferred way to generate these files, my apologies---feel free to just post a pointer.

davidlmobley commented 9 years ago

I have not validated whether acpype handles box conversions properly. (At one point in the past, it did not). So normally I just prep the molecule itself in AMBER and then solvate in GROMACS. Do you know?

(We should create an issue on GitHub to lay out the protocol for re-generating everything from the source data. I'm working on figuring out who in my lab can go ahead and do this, but as noted that's a separate issue - the most immediate solution is just to strip the waters.)

jchodera commented 9 years ago

I don't think we can invest any time in trying to fix up manually curated files with throwaway scripts. If we do put time into this, it has to be to establish automated pipelines that build this from the ground up.

Creating a workflow to create unsolvated and solvated AMBER prmtop/inpcrd files and convert to gromacs via acpype would be pretty easy if we find this acceptable for now. There are other options too, such as using OpenMM to solvate and write a PDB file and then converting directly to AMBER and gromacs, but that might be a bit trickier right now. Eventually, these protocols can be reworked to use tools like gaff2xml once the public API is stable.

Info on acpype testing is here: https://code.google.com/p/acpype/wiki/TestingAcpypeAmb2gmx

jchodera commented 9 years ago

See #22

davidlmobley commented 9 years ago

For now, we absolutely ought to be doing the same thing we (in my group) have always done for these which is to create AMBER files exactly as you describe and convert to GROMACS. If you think you have time to do so today, that's awesome. Otherwise I can put a student on it shortly.

(And, if my student for some reason takes a while to get this done, I'm not ruling out that I will whip out a one-off script to just quickly strip the waters so at least everything is consistent - since effectively that's what Sereina is having to do right now anyway.)

davidlmobley commented 7 years ago

This was resolved by the full rebuild of the database for version 0.5, in #28 .