MobleyLab / Lomap

Alchemical mutation scoring map
MIT License
37 stars 17 forks source link

MCS atom mappings and hydrogens #9

Closed cgorgulla closed 7 years ago

cgorgulla commented 7 years ago

For my applications I need the mappings between the atom indices of the pairs of molecules which are connected in the in network graph. I.e. a list of pairs of integers. For instance, suppose there are two molecules each having 3 atoms, and the two molecules share two atoms, I need a mapping of the form

1 1
2 3

where the first column contains the indices of the "common" atoms of the first molecule and the second column contains the indices of the common atoms in the second molecule.

Your code seems to be able to provide this information, but due to the yet minimal documentation (and lack of time to study the source code thoroughly) I just added a quick hack which gave me the information I was looking for (which might not be the most elegant way). I got the information (written to files) by adding the last three of the following lines to the code in the dbmol.py:

# Maximum Common Subgraph (MCS) calculation    
MC = mcs.MCS(moli, molj, options=self.options)
with open("mcs_mapping_" + str(i) + "_" + str(j), "w") as mappingFile:
    for item in MC._MCS__map_moli_molj:
        mappingFile.write(str(item[0]) + " " + str(item[1]) + "\n")

Now the problem is that this outputs only the MCS mappings without the hydrogen atoms, but I also need the hydrogens included in mappings. How can this be done?

In the file dbmol.py, line 310, is also a small bug/typo:
loggign.warning(30*'-') which results the termination of the program when the warning is going to be issued.

davidlmobley commented 7 years ago

@nividic - this seems like something we should be fixing if at all possible. Certainly the bug/typo should be fixed, as well as outputting as much as we can about the mappings.

It strikes me that we should be able to also get the hydrogen mappings from the heavy atom mappings/connectivity, even if we haven't used RDKit to get the mappings of these. But we can discuss in person if needed. Ideally we could handle this whole issue.

nividic commented 7 years ago

This is not easy because the previous implementation and the new one are strongly based on this assumption during the MCS calculation and scoring. I can suggest to define a new function where the MCS is re-calculated from scratch. It should work like that:

mcs_map = lomap.MCS.getMapping(mola, molb, hydrogens=TRUE)

In the current code is:

MC = lomap.MCS(mola,molb)
mcs_map = MC.getMap()
davidlmobley commented 7 years ago

@nividic - ah, using heavy atom count for scoring but then doing a more complete MCS mapping for final output sounds like it should work well. Can you do that?

nividic commented 7 years ago

Can you be more specific?

cgorgulla commented 7 years ago

@nividic, @davidlmobley: Thank you very much for addressing the requests which I have made, I highly appreciate it.

The above mentioned approach sounds reasonable to me. Another possibility which came to my mind was to simply include all atom types in the MCS mapping also during scoring. Or is there a special reason why previously only the heavy atoms were considered in the MCS mapping/scoring?

halx commented 7 years ago

I don't know what the original motivations was but eventually the aim is to find those pairs that can be most efficiently (computationally, statistically) transformed. This implies that any suitable form of similarity can be chosen, meaning that it doesn't have to be a MCSS although it seems very reasonable that this is a good way forward (maximum match=minimal dummy atoms). The hydrogens don't really matter for that. They just make the molecular graph larger which may mean considerably longer runtimes or even problems. Remember that the time complexity of the MCS algorithm is NP complete.

On 17 February 2017 at 14:07, Christoph Gorgulla notifications@github.com wrote:

@nividic https://github.com/nividic, @davidlmobley https://github.com/davidlmobley: Thank you very much for addressing the requests which I have made, I highly appreciate it.

The above mentioned approach sounds reasonable to me. Another possibility which came to my mind was to simply include all atom types in the MCS mapping also during scoring. Or is there a special reason why previously only the heavy atoms were considered in the MCS mapping/scoring?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MobleyLab/Lomap/issues/9#issuecomment-280658492, or mute the thread https://github.com/notifications/unsubscribe-auth/AH3aWiJjH85eu-vsUZdev2k4hqSsUmLMks5rdamNgaJpZM4L-LSV .

davidlmobley commented 7 years ago

There were two reasons (that I remember off the top of my head) for not including the hydrogens in the mapping originally, @cgorgulla : a) Conventional wisdom was that the "difficulty" of relative free energy calculations has to do with the number of heavy atoms changed, not changes in the number of protons, so these were thought to be unimportant or less important for scoring the difficulty of transformations b) Including hydrogens initially would make the scoring much much slower, as Hannes notes.

As Hannes also notes, a very interesting avenue of future research is to explore other types of scoring rather than MCS. Personally, I'm inclined to think that the "difficulty" of relative free energy calculations will have more to do with changes in shape/chemistry, dipole moment, and things like locations of hydrogen bond donors and acceptors than it will with concepts like the number of heavy atoms. So I'd like to see a future study explore whether it's more efficient to create maps based on these kinds of ideas rather than MCS.

Of course, one still needs to generate a topological mapping of atoms to be mutated anyway, so topological similarity will still have some importance even if another similarity measure is better in the end.

davidlmobley commented 7 years ago

@nividic - you got this in in one of your February updates, right? If so please mark as closed and indicate how these can be accessed.

nividic commented 7 years ago

This problem has been fixed. A getMapping() function has been added where the keyword hydrogen=True can be used to have the mcs mapping including the hydrogens e.g:

MC = lomap.MCS.getMapping(mol1, mol2, hydrogens=True, fname='mcs.png')

where mol1 and mol2 are two rdkit molecules

cgorgulla commented 7 years ago

@davidlmobley @halx : I still need to thank you both very much for your explanations.

And also again many thanks to @nividic for the implementation of the new features.