MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.31k stars 647 forks source link

HydrogenBondAnalysis: KeyError for heavy atoms when selection is updated #1687

Open dnlbauer opened 6 years ago

dnlbauer commented 6 years ago

Expected behaviour

When I want to get hydrogen bonding frequencies (h.count_by_type()) in combination with the update_selections flag, it should built the proper frequencies table.

Actual behaviour

Under some circumstances, if update_selection is set to true, a given index might not be present in self._s2_donors when the heavy atom lookup table is built. This leads to a crash when the heavy atoms are generated because the key is not present in the lookup table. This happens mostly when the selection that is set to update includes solvent or other rapid exchanging residues.

Code to reproduce the behaviour

import MDAnalysis as mda
u = mda.Universe(top, trj)

# this works
h = mda.analysis.hbonds.HydrogenBondAnalysis(u, "resid 1-5", "resname SOL", update_selection1=False, update_selection2=False)
h.run()
h.generate_table()
frequencies = h.count_by_type()

# this fails
h = mda.analysis.hbonds.HydrogenBondAnalysis(u, "resid 1-5", "resname SOL", update_selection1=False, update_selection2=True)
h.run()
h.generate_table()
frequencies = h.count_by_type()

Traceback (most recent call last):

  File "<ipython-input-7-438a429831da>", line 16, in <module>
    frequencies = h.count_by_type()

  File "/home/bauer/.local/lib/python3.6/site-packages/MDAnalysis/analysis/hbonds/hbond_analysis.py", line 1149, in count_by_type
    r.donor_heavy_atom[:] = [h2donor[idx] for idx in r.donor_index]

  File "/home/bauer/.local/lib/python3.6/site-packages/MDAnalysis/analysis/hbonds/hbond_analysis.py", line 1149, in <listcomp>
    r.donor_heavy_atom[:] = [h2donor[idx] for idx in r.donor_index]

KeyError: 29971

Currently version of MDAnalysis:

import sys
import MDAnalysis as md
md.__version__
# '0.16.0'
sys.version
# '3.6.2 (default, Jul 20 2017, 03:52:27) \n[GCC 7.1.1 20170630]'
orbeckst commented 6 years ago

@danijoo are you able to reproduce the error with the latest release 0.16.2, too?

(It is generally safe to upgrade within the minor release number and there were various bug fixes 0.16.0 --> 0.16.2 and there was a fix for HBanalysis although it is likely not fixing your problem, but still, we need to work from the latest release.)

Are you able to reproduce the error with one of the existing trajectories, e.g.,

from MDAnalysis.tests.datafiles import TPR, XTC

If yes, then it is a lot easier for a developer to have a look at it. I would consider providing a repoducible failure the biggest contribution a user can make to fixing a bug. My experience is that with a reproducible failure, developers are much more likely to "just have a quick look" and then actually get to fixing it. On the other hand, finding a test/failure first is probably the number one reason why issues don't get addressed promptly.

orbeckst commented 6 years ago

ping @danijoo – it would really help if you could provide more information and address the questions. Thanks.

aakognole commented 5 years ago

@orbeckst - I am having this same issue as I am trying it with RNA molecule adding new donors and acceptors.

for resid in rna.residues.resids:
    h = MDAnalysis.analysis.hbonds.HydrogenBondAnalysis(u, 'nucleic and resid %d' % (resid), 'nucleic and not resid %d' % (resid), update_selection1=False, update_selection2=False, distance=2.5, angle=120.0, distance_type="hydrogen", donors=("ADE N6", "URA N3", "GUA N1", "GUA N2", "CYT N4", "ADE O2'", "URA O2'", "GUA O2'", "CYT O2'"), acceptors=("ADE N1", "ADE N7", "ADE N3", "URA O4", "URA O2", "GUA O6", "GUA N7", "GUA N3", "CYT N3", "CYT O2"))
    h.run()
    h.generate_table()
    print h.table
    hblist = np.array(h.count_by_type())

The code works fine for some residues and then gives the error for count_by_type() call:

...
Traceback (most recent call last):
  File "h-bond-analysis.py", line 28, in <module>
    hblist = np.array(h.count_by_type())
  File "/home/akognole/modules/miniconda2/lib/python2.7/site-packages/MDAnalysis/analysis/hbonds/hbond_analysis.py", line 1301, in count_by_type
    r.donor_heavy_atom[:] = [h2donor[idx] for idx in r.donor_index]
KeyError: 1168

based on h.table output, "1168" appears to be the index for a hydrogen H3 connected to heavy atom N3 ( 0. , 1168, 526, u'URA', 37, u'H3', u'ADE', 17, u'N3', 2.1326344 , 158.65791505)

I tried your test datafiles, and they worked fine until for the last GLY residue I got error saying "No acceptors in selection 1" which should just continue with a warning maybe.

But, the index error above is something different. Any idea what could be wrong?

aakognole commented 5 years ago

I guess it is an error caused by some inconsistency in topology.

actually, having a really big trajectory file, I did not want to run the analysis over all the frames and there is no option to skip frames, so I wrote a dcd file skipping frames and selecting only RNA. Before that I also wrote a pdb file to use as a topology and read those into an Universe to use for hbond analysis. This was giving the error.

rna = u.select_atoms("nucleic")
#rna.write("nucleic.pdb")
#with MDAnalysis.Writer("nucleic.dcd", rna.n_atoms) as W:
#    for ts in u.trajectory[0::50]:
#        W.write(rna)
#u = mda.Universe("nucleic.pdb", "nucleic.dcd")

When I switched back to full trajectory and psf it worked. Thanks!

orbeckst commented 5 years ago

and there is no option to skip frames

There is: in the HydrogenBondAnalysis.run() method:

h.run(step=20)
aakognole commented 5 years ago

I guess I missed it. Thanks!

aakognole commented 5 years ago

There is: in the HydrogenBondAnalysis.run() method:

I was just trying this, but it does not seem to skip the frames. I also tried it with your test data files which has 10 frames. I tried step=2 and step=5. I got the table showing the h-bonds in all the frames.

orbeckst commented 5 years ago

Sorry, if this is not working as documented then it's a bug. Can you please open a new issue with this bug? Thanks!

-- Oliver Beckstein email: orbeckst@gmail.com

Am Oct 23, 2018 um 09:08 schrieb Abhishek Kognole notifications@github.com:

There is: in the HydrogenBondAnalysis.run() method:

I was just trying this, but it does not seem to skip the frames. I also tried it with your test data files which has 10 frames. I tried step=2 and step=5. I got the table showing the h-bonds in all the frames.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

orbeckst commented 5 years ago

I can reproduce this error with the test data files on the latest develop (0.19.1-dev)

import MDAnalysis as mda; from MDAnalysis.tests.datafiles import TPR, XTC
import MDAnalysis.analysis.hbonds
u = mda.Universe(TPR, XTC)
h = MDAnalysis.analysis.hbonds.HydrogenBondAnalysis(u, 'protein', 'resname SOL')
h.run()
h.count_by_type()

raises KeyError:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-31-248a3b77edc0> in <module>
      4 h = MDAnalysis.analysis.hbonds.HydrogenBondAnalysis(u, 'protein', 'resname SOL')
      5 h.run()
----> 6 h.count_by_type()

/Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/mdanalysis/package/MDAnalysis/analysis/hbonds/hbond_analysis.py in count_by_type(self)
   1277         # patch in donor heavy atom names (replaces '?' in the key)
   1278         h2donor = self._donor_lookup_table_byindex()
-> 1279         r.donor_heavy_atom[:] = [h2donor[idx] for idx in r.donor_index]
   1280
   1281         return r

/Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/mdanalysis/package/MDAnalysis/analysis/hbonds/hbond_analysis.py in <listcomp>(.0)
   1277         # patch in donor heavy atom names (replaces '?' in the key)
   1278         h2donor = self._donor_lookup_table_byindex()
-> 1279         r.donor_heavy_atom[:] = [h2donor[idx] for idx in r.donor_index]
   1280
   1281         return r

KeyError: 3414
orbeckst commented 5 years ago

@richardjgowers this is not a good bug to have hanging around for the workshop...

richardjgowers commented 5 years ago

@orbeckst I wasn't planning on using this class, I've never understood how it works. It looks like it builds the lookup table for donors once and when you update selections this becomes wrong (after a single frame)

orbeckst commented 5 years ago

Fair enough, but even if you're not using, someone might to do a problem during the "open session".

It is quite possible that this is broken with updating groups.

orbeckst commented 5 years ago

Yes, it is broken, and I documented it a long time ago ...

Assumptions: selections have not changed (because we are simply looking at the last content of the donors and donor hydrogen lists)

https://github.com/MDAnalysis/mdanalysis/blob/c9a37e3a494f741014cf3fedf1fa2606c7b43a24/package/MDAnalysis/analysis/hbonds/hbond_analysis.py#L1395-L1426

orbeckst commented 5 years ago

Admittedly, the groups in my example https://github.com/MDAnalysis/mdanalysis/issues/1687#issuecomment-433207623 are not even changing – they are just protein and resname SOL so I am a bit confused why that fails.

orbeckst commented 5 years ago
  1. A possible fix is to rewrite _donor_lookup_table_byindex and _donor_lookup_table_byres so that they take a list of all donor H atom indices that are of interest and then completely re-derive the donor heavy atoms, using the inverse of _get_bonded_hydrogens_dist. (We can just fail for _get_bonded_hydrogens_list because that will go away.)

  2. Or we just add the donor heavy atoms from self._s1_donors and self._s2_donors to the internal h._timeseries and use those instead of all this cr*&!

I am infinitely in favor of 2.

orbeckst commented 5 years ago

On a related note, we should probably try to find hydrogens via bonds if we have them, as the first choice before _get_bonded_hydrogens_dist.

richardjgowers commented 5 years ago

Yeah I opened #2120 for the lifetime tool (which expects the AGs to be made a priori), could be possible to share