Add CD benchmarks to manuscript

jchodera commented 7 years ago

It looks like a lot of those mol2 files are post-antechamber with GAFF atom types. Since these are non-compliant mol2 files that cannot be used as standard Tripos mol2 files, don't you want to put those in a special directory that isn't named something like mol2/ and put lots of warnings in the README? It would also be great to make sure the Tripos-format mol2 files are included for each of those.

jchodera commented 7 years ago

Also, if those antechamber-format mol2 files are intended for use with a very specific version of gaff.dat, you may want to include that version here.

500+ files is also a lot. Many of these formats (mol2, SDF) support multi-molecule files where all the ligands can be compiled into a single file. That may be helpful here!

nhenriksen commented 7 years ago

The original CB7 and OA files were added by @Janeyin600, so I followed her format for the CDs. I agree that the current mol2 format is not ideal for universal usage. I like the idea of putting them all into a single file.

The GAFF version is mentioned in the README (GAFF v1.7).

davidlmobley commented 7 years ago

Thanks, @nhenriksen ! I'm super swamped right now because of a couple of deadlines early next week, but hopefully can review this in detail next week.

In terms of mol2 files -- I think I agree it would be good if we provided GAFF and "standard" mol2 files of everything. OpenEye tools can do this conversion easily. If you're on board with that, I could add a script for it to the repo and probably handle conversions of everything (or you can do it if you guys have your OE license renewed yet). Thoughts?

nhenriksen commented 7 years ago

@davidlmobley I think I can get this done today.

nhenriksen commented 7 years ago

I've redone all the CD benchmark files. All the mol2 files use SYBYL atom types now. They are all generated from the original bound structures (secondary orientation), which means that all guest coordinates fit into the host coordinates from a single file (that way the aligning procedure that @andrrizzi used is no longer necessary). I've also generated multi-molecule versions for the .mol2, .sdf, and .pdb formats.

Before making commits, I just want to follow-up on some items:

How do we feel about using multi-molecule files instead of the one-guest-per-file approach? On reflection, although it is cleaner for archival purposes, the multi-molecule format might be more annoying for users to parse. I currently am in favor of one-guest-per-file approach, but don't feel super strongly about it.
Do we need copies of the .mol2 files with GAFF atom types? The new files I've generated have SYBYL. If someone really wants GAFF types they could just run the .mol2 files through antechamber.
If you recall, the guests in these benchmarks have effectively two binding orientations that don't really interchange on the simulation timescale. Now that the host and guest .mol2 coordinates all represent the secondary (wider end) conformation, it might bias a users expectation about the orientation. I plan to highlight this in the README.
I've placed CONECT lines in the host PDB files. It's necessary to get some viewers to connect the first and last residues into a ring. Let me know if there are any foreseeable issues with CONECT lines.

davidlmobley commented 7 years ago

@nhenriksen

How do we feel about using multi-molecule files instead of the one-guest-per-file approach? On reflection, although it is cleaner for archival purposes, the multi-molecule format might be more annoying for users to parse. I currently am in favor of one-guest-per-file approach, but don't feel super strongly about it.

I think @jchodera has a bit of an aversion to "lots of files in the repo", but I agree with you that for most "normal simulation" users, multi-molecule mol2 files are not useful -- they basically mean you either need to rely on a chemistry toolkit for parsing them (OpenEye software, RDKit, etc.) or write your own tool to separate them out into individual files. Thus, my preference is actually to have the individual files, OR provide a tool which can break them into these, otherwise we're just asking other people to do the work so that our repo can stay tidy. Sorry, I should have answered that aspect before.

Do we need copies of the .mol2 files with GAFF atom types? The new files I've generated have SYBYL. If someone really wants GAFF types they could just run the .mol2 files through antechamber.

I don't think we need GAFF types, no.

If you recall, the guests in these benchmarks have effectively two binding orientations that don't really interchange on the simulation timescale. Now that the host and guest .mol2 coordinates all represent the secondary (wider end) conformation, it might bias a users expectation about the orientation. I plan to highlight this in the README.

Sounds great.

I've placed CONECT lines in the host PDB files. It's necessary to get some viewers to connect the first and last residues into a ring. Let me know if there are any foreseeable issues with CONECT lines.

They should be there! So I'm all for having them. :) I don't think our tools will correctly process the PDB files without them.

nhenriksen commented 7 years ago

I've updated the input files and READMEs for the host-guest benchmark files. (For some reason I screwed up the commit process and it did it twice ... not sure how that was possible.) Updates include:

New .mol2, .pdb, and .sdf files for the CD benchmarks. These are now in a proper bound state conformation to a single host molecule. I've added header/remark lines in these files identifying them.
Updated the README files in cd-set1 and cd-set2 to include host and guest names along with a spot for reference computational studies. There are few more papers I can add besides mine, but will do in the future.
Changed the .sd extension to be .sdf as I think that is more standard ...?

Things to be addressed in the future:

The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works.
The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host. They are close, but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.
The BRD4 benchmark .sd files do not appear to be valid .mol or .sdf format. I'm actually not sure who uses this format, but at least Chimera won't open it. These guidelines seem to indicate that something bad happened when Germano converted with Babel. But I'm not familiar enough to really know.

davidlmobley commented 7 years ago

@GHeinzelmann - can you see Niel's note above? Specifically:

The BRD4 benchmark .sd files do not appear to be valid .mol or .sdf format. I'm actually not sure who uses this format, but at least Chimera won't open it. These guidelines seem to indicate that something bad happened when Germano converted with Babel. But I'm not familiar enough to really know.

@nhenriksen - thanks for all the work. I'll review as soon as I can. At the moment I'm slightly swamped dealing with SAMPL so I'm not going to get to it as soon as I hoped to (I'd hoped to have reviewed it already!). It looks like you're being quite thorough though.

I can create issues for these:

Things to be addressed in the future:

The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works.

The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host. They are close, but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.

I think I can fix the second of these, at least for the GDCC case, by docking into the host using the code I just figured out. Alternatively your strategy of extracting from the equilibrated files should work. Do you have thoughts on which is preferable? @Janeyin600 might also want to weigh in?

nhenriksen commented 7 years ago

@davidlmobley Whenever you get to reviewing is fine, no urgency on my part.

If you want to get the CB7/GDCC guests docked with your code, that would be great!

GHeinzelmann commented 7 years ago

Hi all. About the sd files, I have never used them either, and they did not open on VMD or chimera when I tried it here. The only reason I included them is that this format is present in the guest-host input files (in the SDfiles folder, which I renamed sd to follow Niel's changes in his branch).

I just saw that Niel has changed them to .sdf format, so I'll do the same. Best,

Germano

nhenriksen commented 7 years ago

@GHeinzelmann I used antechamber to make the .sdf for the CDs, but I just tried on your protein .mol2 and it created something that couldn't be read by Chimera. I don't think that format is used by very many people, but I do remember someone used it for SAMPL5 so I think that's why @Janeyin600 included them.

GHeinzelmann commented 7 years ago

OK so I checked a few things and found the problems and the solutions for this part (Niel's comments and responses below):

- The BRD4 benchmark .sd files do not appear to be valid .mol or .sdf format. I'm actually not sure who uses this format, but at least Chimera won't open it. These guidelines seem to indicate that something bad happened when Germano converted with Babel. But I'm not familiar enough to really know.

All the ligand files are in all in a valid .sdf format, and they were converted correctly. The only problem is the file names, since chimera does not recognize the extension .sd automatically. Once the file names are changed to .sdf instead of .sd, chimera will open them normally.

- I used antechamber to make the .sdf for the CDs, but I just tried on your protein .mol2 and it created something that couldn't be read by Chimera. I don't think that format is used by very many people, but I do remember someone used it for SAMPL5 so I think that's why @Janeyin600 included them.

The protein .mol2 file is OK, but the sdf format converted by openbabel (from the .mol2 file) really does not open on chimera, and it looks different from the ligand ones. So the file that has to be fixed is the 4lyi.sd one, which I can take care of. I'll do the changes in my branch and open a pull request. Best,

Germano

davidlmobley commented 7 years ago

So, @nhenriksen and @GHeinzelmann - I wonder if a more general solution to this problem would be to just designate one particular set of files as "source files" and provide a script that will re-generate all of the other files from these source files. For example, we could designate .mol2 files as source files and then I could provide a script that will re-generate all of the .sdf files from these using the OpenEye toolkits (or the other way around). Thoughts? That would avoid the current problem where we seem to have heterogeneous sdf files manually created from various different sources. :)

I could probably also make the same tool automatically make 2D depictions of everything just to make it easy to make sure all the molecules are what they are supposed to be. :)

In terms of SDF vs mol2: Both formats have pros and cons. SDF is nice because it allows custom, XML-like tags so they can include lots of additional information such as info on the origin of the files, etc. But mol2 is nice because it (a) includes bond order, and (b) can include partial charge, neither of which are in SDF unless they are embedded in custom tags.

nhenriksen commented 7 years ago

An automated approach would be great!

GHeinzelmann commented 7 years ago

I think that is a good idea, and also have the same procedure for the preparation of files in all cases. For BRD4 all the ligand .mol2 files were generated by antechamber (with partial charges), but I'll also do that for the protein file 4lyi.mol2, which was previously prepared using chimera. Perhaps we could use those as a starting point and then find provide scripts to convert them to .sdf and pdb? Could also be the other way around, starting with the .sdf ones.

GHeinzelmann commented 7 years ago

Quick update on generating apo BRD4 structure mol2 file. It seems that antechamber does not support molecules with more than 10 residues for generating charges/mol2 files, so although it works for the CD hosts it will not work in the case of the apo BRD4. The option in that case is to use chimera, which is what I did before, or another program such as OpenEye.

davidlmobley commented 7 years ago

@GHeinzelmann - can we spin this off to a separate issue if it needs further discussion? I don't actually think it's going to be useful to have a mol2 file of the protein (I think they are useful primarily for the case of non-polymers) and certainly no one is going to want to compute partial charges for it using Antechamber (the charge calculation won't converge).

I apologize that my language above was unclear. I was really talking about having a script that uses source files (mol2 files) for all "small molecules" here, which would include guests. But in my view we should just keep proteins as is, in PDB only.

GHeinzelmann commented 7 years ago

OK, so we can leave the apo protein files (4lyi) only in the prmtop-coords and pdb folders, and remove it from the .sd and .mol2 ones, leaving only the ligands for these two last folders. The rest is already OK, except for the folder name and file extensions in the case of the sd files, which should be changed to sdf instead of sd (like @nhenriksen has done). Once the extension name is changed from .sd to .sdf, chimera is able to read them automatically, since they are already in sdf format (converted using babel).

I see that this pull request is from Niel's branch, so what is the best way for me to include these changes, should I open a new one? It is very quick to do, just not sure on the best way to proceed in that regard.

davidlmobley commented 7 years ago

Yes, @GHeinzelmann - please open a new PR. Think of a PR as a "logical unit"; this PR is about incorporating Niel's cyclodextrin changes into the paper and the repo, so if you're changing something different, it should be in a different PR, from a different branch.

GHeinzelmann commented 7 years ago

OK, I'll open the pull request, make the changes and check if everything looks OK now.

nhenriksen commented 7 years ago

@GHeinzelmann noticed an issue with SYBYL atom types for some of his ligands, which also happens to be an issue for my guests. It appears that cpptraj is getting the bond and atom types wrong for carboxylates and some amides. But given the limitations of SYBYL types, I'm curious if anyone has input on how bond order and atom type should be assigned for carboxylates. To illustrate the issue:

The problem arises because we basically think of the carboxylate oxygens in a resonance hybrid bond ... but if we give them both a bond order of 2, then the central carbon has 5 bonds to it. I think the antechamber SYBYL approach is correct, but it results in only 3 bonds to the central carbon. That doesn't really cause me any problems, but I'm curious if that is a problem for openeye stuff, or if there's any other comments about it.

davidlmobley commented 7 years ago

Can you give me a tarball with files having all of these (or whatever subset you have) and I'll process them with OEChem and see what happens? Superficially I think the top two are OK and the bottom three are problematic, but it's POSSIBLE that OpenEye would rectify this somehow. (Or if they are already in the repo, just tell me which files to use...)

I likely wont' get to look at this for a bit though as I'm still dealing with those deadlines.

nhenriksen commented 7 years ago

I don't have files for the first two, they're just intended to represent the conventional 2D structures that we might see in a paper.

But here are the other three examples. carboxylates.tar.gz

jchodera commented 7 years ago

Where did the molecules originally come from before they were converted to GAFF?

In the worst-case scenario, we could simply regenerate them from canonical isomeric explicit-hydrogen SMILES strings, which should uniquely represent the intended molecules and their protonation states.

nhenriksen commented 7 years ago

They originally were just PDBs that I generated from SMILES strings. Then I converted to .mol2 with GAFF types using antechamber. I'm pretty sure the "antechamber SYBYL" version above is correct (and cpptraj is getting it wrong) ... so I wanted to point out that resonance hybrid type bonds could occasionally be tricky for transferring between force fields/chemistry toolkits.

GHeinzelmann commented 7 years ago

In the BRD4 case they were taken from the crystal structures, except Compound 8a which does not have the bound complex so I downloaded the .sdf from PubChem. They were then converted to .mol2 using GAFF.

GHeinzelmann commented 7 years ago

The protonation states were obtained according to the experimental papers, in my case I used openbabel since it agreed with the paper structures.

davidlmobley commented 7 years ago

Also, I forgot this comment:

Your README.md files in the cd-set1 and cd-set2 directories are beautiful and sure to be very helpful. Do you want to also provide isomeric SMILES in these? I'm OK if not, but I'm thinking that will increase the likelihood that people will actually do something with them (makes it easier to generate input files)

davidlmobley commented 7 years ago

BTW I think we are almost ready to merge, so once you get these edits in (plus my PR) we should go ahead and send the LaTeX/PDF to Mike so he can edit.

@GHeinzelmann - please also review. :)

nhenriksen commented 7 years ago

A few clarifications ...

Changes suggested:

The updated Figure 1 removes TEMOA and adds beta-CD, though the benchmark includes both beta-CD and gamma-CD. I think my preference would be to just expand the figure (making it two rows perhaps) to keep TEMOA and add beta and gamma-CD

The figure is already 3 rows due to the various views and takes up slightly more than half the page. My thinking was that due to the similarity of OA/TEMOA and α-CD/β-CD, we could just use a single representative. But assuming we do want to display all 5 hosts, shall I aim for a full page figure then?

I'm thinking it would be convenient for Fig. 1 to say which is the "wider end" vs the "narrower end"; I didn't notice they were asymmetric at first from just the text.

Good idea.

"The behavior of phosphate ions, particularly the HPO_4^{2-} species, can be sensitive to the particular choice of water model..." seems to need a reference.

I don't have a reference for that because it's unpublished work. We could add an "(unpublished)" to it?

The SMILES for the aromatic guests use different kekulization than those in the other tables; e.g. benzoic acid shows up in Table IV and Table VI with different SMILES. If you guys have the OpenEye tools again I can easily generate SMILES for you, or you can do it, but I'd prefer to treat them consistently. (To say it another way: The aromatic rings get "aromatic carbon" (small c1ccc(cc1)) notation in all my SMILES strings, whereas yours use alternating singles and doubles (C1=CC=C(C=C1)) notation.

Good point, I'll redo those.

Questions:

"Despite the flexibility of CD hosts, the small size of the guests combined with long timescale simulations enable by GPUs should allow adequate convergence of these thermodynamic values." Would it be accurate to say, "appears to allow adequate convergence..." and give a reference to your study? This would be a little stronger of a statement.

Sure, that works.

"Since the cavity is both narrow and hydrophobic, interchange between the two orientations is usually slow and requires release into the solvent and subsequent rebinding"; can you say anything about the timescales for this from your experience or prior work, e.g. "These events may take as long as XYZ microseconds for some guests" or similar?

It's hard to say, since I haven't done any long time-scale on/off type simulations. For strong binders, I don't see any unbinding on the microsecond timescale, so interchange would definitely exceed that due to all the additional time it takes to diffuse around the box once it does successfully unbind. I could say something like "Sampling these events is expected to require microsecond-scale simulations or longer."

Your tables use "mol/kg" units for buffer. I realize this is probably what the experiments were reported with, but it's inconvenient for computational folks and could introduce a source of error in setup (if everyone is converting). Maybe we want to put "(approximately XYZ mM)" in parens to help people out?

Actually I think molality is nicer to deal with than molarity, since it won't depend on volume/temp./pressure. But I agree that it is less common. At lower concentrations, it will have essentially the same molarity as molality since 1 kg of the solution will be about 1 Liter. I couldn't find any exact data for density of 50 mM sodium phosphate, but I did find it for potassium phosphate, which shows that it is very close to 1 g/cm^3. So yes, I think we could say "(approximately 50 mM)" in the table footnotes.

Your README.md files in the cd-set1 and cd-set2 directories are beautiful and sure to be very helpful. Do you want to also provide isomeric SMILES in these? I'm OK if not, but I'm thinking that will increase the likelihood that people will actually do something with them (makes it easier to generate input files)

Yes, I can add them ... although the table might get squeezed for space. I'll try it out.

davidlmobley commented 7 years ago

@nhenriksen

The figure is already 3 rows due to the various views and takes up slightly more than half the page. My thinking was that due to the similarity of OA/TEMOA and α-CD/β-CD, we could just use a single representative. But assuming we do want to display all 5 hosts, shall I aim for a full page figure then?

True, it was getting back. But, we're electronic-only now and don't have to worry about page limits, so, why not full-page? Or we can split it into three separate figures, one for each category of host, if having a full-page figure bothers you. :) I'd rather be able to see all three of them, myself -- otherwise first thing I'll do wonder what the other ones look like and wish for a figure. Ha.

I don't have a reference for that because it's unpublished work. We could add an "(unpublished)" to it?

Sure, cite yourself...? I think we just need to say how we know.

I could say something like "Sampling these events is expected to require microsecond-scale simulations or longer."

That sounds good to me; you could even do better and say "... because previous work has not observed binding and unbinding events in microsecond-length simulations [ref your paper]."

Otherwise, all sounds good.

nhenriksen commented 7 years ago

@davidlmobley I've addressed your requested changes in the latest commit. I had to shrink the images in Fig 1 to get them to fit. The figure might work better on a landscape page rather than portrait, but that might be more trouble than it's worth to do. All the .png's are in the figure_source_files directory if you have ideas for different formats.

davidlmobley commented 7 years ago

@nhenriksen - Thanks! Hopefully I'll review late today. I'll go ahead and send TeX/PDF to Mike, CC you.

davidlmobley commented 7 years ago

Also, random comments on your last commit:

The Zhang 2016 paper has a bunch more of the same guests, but they were all treated as neutral, whereas the experiment had the guests charged. I've omitted them to avoid confusion.

Gosh, that's a huge issue. Yes, definitely want to avoid confusion.

Also, I just realized that names like "4-methylphenylacetate" may have caused confusion in the literature. Rekharsky 1997 used that name to refer to the charged version of 4-methylphenylacetic acid, but both Wickstrom 2013 or Zhang 2016 interpreted it as the structure that is more commonly known as "p-tolyl acetate" which is totally different. This occurred for several different related compounds. I've left those off the list as well, because clearly the numbers won't be comparable.

This is part of why I don't like to use compound names as identifiers ever -- using a compound name is begging a human to interpret it (partly because automated name parsers sometimes make mistakes, and/or can't interpret names that humans can), and once you get a human involved they will misinterpret things. Hence, SMILES and/or PubChem compound IDs...

I'm starting to work with some people on some best practices documents for MD simulations. I should probably mention the "please don't use compound names" thing in those. :)

MobleyLab / benchmarksets

Add CD benchmarks to manuscript #47