alchemistry / fileformat

File formats for free energy calculations, molecular simulations, etc.
Other
2 stars 2 forks source link

Requirements: Chemoinformatics #9

Open avirshup opened 8 years ago

avirshup commented 8 years ago

Lightly edited discussion, migrated from slack:

@avirshup:

[...] we're trying to limit the scope to 3D geometry-based computational chemistry. So we'd like this file format to support MD, docking, quantum chemistry, etc. But we don't want to replicate PDB or mmCIF files (they store a lot of experimental crystallographic data that's not computationally relevant); nor do we want to get too far into chemoinformatics (which really requires a robust graph theory foundation, like OEChem has)

@davidlmobley:

[...] can you expand a little on what you mean? I have a little bit of paranoia about what happens if one does NOT think of molecules as connected graphs (as is done in PDB files) because this tends to lead to an underemphasis on connectivity, bond information, etc. that tends to cause downstream problems in small molecule simulations. One way I could interpret what you're saying is that you're focused on that same type of info (where bonds are basically incidental and don't get much attention), though another way I could interpret is that you're not that focused on enabling cheminformatics but you're still going to pay considerable attention to bonds, etc.

@jchodera:

[...] I think it is fine to rely on toolkits like openeye and RDKit for those capabilities, but essential that the "topology" data structure have enough information to initialize molecular objects in these toolkits. That's why we need a minimum amount of information about elements, connectivity (if known), and bond orders (and possibly net formal charge?) in order to instantiate molecule representations in those toolkits.

@avirshup:

[...] I 100% agree that you absolutely want to store bond information and have a complete, connected chemical graph, unlike PDB files. As a design principle for MDT, however, I'm trying to make sure we don't end up replicating RDKit or OpenBabel [...] And @john.chodera, yes, definitely: we should be storing enough information to build an RDKit or OEChem object [...] From experience with chemoinformatics, though, we want to avoid having to encode anything like hybridization assignment, or Kekulé structures, or SSSR information, or (especially) implicit hydrogens; results for all of these can vary HUGELY between different informatics programs

@jchodera:

I think you want the option of having extra information like that added without breaking things (eg if we want to include a specific representation of an OpenEye molecule) but we want to just store the requisite minimum in general, which requires a broadly compatible notion of bond order. Other tool providers may want to handle the JSON representation of their own internal data structures.

@avirshup

Yeah, definitely agree that someone should be able to store implicit hydrogen or SSSR information in the file if they want. But I would argue that, because implementations and interpretations vary so greatly, you would not want it to be part of a base standard

avirshup commented 8 years ago

This may be as simple as needing to store:

And having a convention about implicit hydrogens (my preference: no implicit hydrogens allowed; missing hydrogens are fine)

egonw commented 8 years ago

Strong suggestion: use explicit hydrogens or explicit atom types which define explicit hydrogen counts. For bond orders, make a clear choice, e.g. only integer (formal) bond orders. Also, don't expect any tool to know what you mean with "aromatic" (they don't and you will loose interoperability when you start using this term).

avirshup commented 8 years ago

@egonw - I basically 100% agree with all of this.

To do due diligence, though: besides aromaticity (which we can deal with via Kekulization), has anyone run across any other use cases for fractional bond orders?

avirshup commented 8 years ago

ParmEd, for instance, uses 1.5 for aromatic and 1.25 for amide. Not sure how this works in actual use cases though:

In [5]: parmed.Bond?
Init signature: parmed.Bond(self, atom1, atom2, type=None, order=1.0)
Docstring:     
[...]
order : float, optional
    The bond order of this bond. Bonds are classified as follows:
        1.0 -- single bond
        2.0 -- double bond
        3.0 -- triple bond
        1.5 -- aromatic bond
        1.25 -- amide bond
    Default is 1.0
davidlmobley commented 8 years ago

I'll get back to the bigger threads here soon, but the bond order thing piqued my interest and connects with something we've been working on, @avirshup . In our new Open Forcefield initiative SMIRFF forcefield format, we just implement support for partial bond orders such as Wiberg bond orders. These are floats; our interest in them is that they allow the potential for interpolation between single/aromatic/double depending on conjugation, etc. (It's worth noting that the AMBER parameters for, say, a carbon-carbon aromatic bond in benzene fall linearly in between the parameters for a single and a double bond so interpolating based on a partial bond order of 1.5 give you essentially exactly the "right" parameters for that specific case.

There's a lot of basic science to be done on this, but the idea of allowing for this is that it allows the surrounding chemistry (which impacts the partial bond order) to potentially impact the bond or torsional parameters that are assigned in a sensible way without having explicitly specify parameters for every bond order which might be of interest -- you just hit the "normal" ones and let the fractional bond order do the rest of the work via interpolation.

davidlmobley commented 8 years ago

So, um, I'd advocate for supporting fractional bond orders, honestly -- if for no other reason than that we have major plans involving fractional bond orders. :)

avirshup commented 8 years ago

@davidlmobley - that sounds really cool!

Could this all be addressed with explicitly alchemical data structures? These would be objects that break all sorts of normal chemical rules - allowing for non-integer atomic numbers, non-integer bond orders, atom types that are linear combinations of other atom types, etc.

EDIT: sorry, got hung up on the "interpolation" thing, when this is actually much more interesting than that - see below

davidlmobley commented 8 years ago

@avirshup -- this is actually a separate thing from free energy calculations entirely. The point is that, in "real chemistry" due to conjugation, etc., bond orders (i.e. Wiberg bond orders) often end up being non-integer due to electrons getting pulled here and there. That is, bonds don't end up being strictly single or double or whatever.

This is related to the issue where nitrogen with three connections can be tetrahedral or planar depending on what's connected to it, or even in between tetrahedral and planar with the right environment. Existing forcefields basically just have to bin everything into "these atom types are tetrahedral" and "these are planar". The though we're operating on here is that building a bit more chemistry -- i.e. via things like partial bond order -- can allow us to catch the "in between cases" just via interpolation.

Relating to alchemical calculations -- I agree there could be interest in defining atom types that are explicitly alchemical (in between other types, etc.), though that's really more an issue of force field parameters than of anything else.

However, that's not what I was talking about here -- here I was talking about tracking information that we will need for molecules in order to be able to assign our force field parameters to the molecules themselves. The one we're thinking about the most at the moment is the partial bond order... :)

davidlmobley commented 8 years ago

I have to run off, but in case I still didn't explain clearly enough, you might just check out this comment as it illustrates a specific example for the case of carbon-carbon bonds like I mentioned above:

https://github.com/open-forcefield-group/smarty/pull/134#issuecomment-244495965

I can also point you to an IPython notebook to play with this if you get interested.

avirshup commented 8 years ago

@davidlmobley: Gotcha. When you've got time, yeah, I'd love to see a notebook! And, scientifically, it sounds seems like it could be an interesting intersection of chemical graph theory, force fields, and quantum chemistry - if you have a recommended reference, would definitely be interested in reading up on it.

davidlmobley commented 8 years ago

@avirshup -- here's a notebook, though you'd need an OpenEye license (and the October betas of OpenEye tools) installed to be able to actually fiddle with it: https://github.com/open-forcefield-group/smarty/blob/master/examples/partial_bondorder/test_partialbondorder.ipynb

I don't have a reference yet. We're writing this up starting around now, but it's basically something we came up with Christopher Bayly over the summer that I think has been stuck in his head for a decade or two and is finally coming out... :)

davidlmobley commented 8 years ago

This may be as simple as needing to store:

  • bond orders

and connectivity! (Though it's not clear to me how you would denote bond orders if you didn't have this)

  • atomic numbers
  • formal charges
  • And having a convention about implicit hydrogens (my preference: no implicit hydrogens allowed; missing hydrogens are fine)

I think I agree with this, @avirshup , except that extensions probably are presently needed and it also needs to be easily extendable (via something like SD tags used in SDF files) to include any other type of info someone would want to carry along. Obvious candidates for right now would be partial charges and partial bond orders; perhaps partial charges should be provided always but be empty if no charges are available (as in .mol2 files)

You probably also want to allow people to carry around atom types and atom names if they want to...