MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.28k stars 646 forks source link

TPRParser should not index from 0 #2364

Closed lilyminium closed 3 years ago

lilyminium commented 4 years ago

Expected behavior

My expected behaviour if is that if I select a residue in gmx make_ndx by residue number 1, and I select a residue with u.select_atoms('resid 1'), they give me the same residue.

Actual behavior

I would get different residues, because GROMACS indexes from 1, but TPRParser indexes from 0. TPR files may index from 0, but they are typically created from human-readable .top or .itp files that are indexed from 1.

Code to reproduce the behavior

>>> from MDAnalysis.tests.datafiles import TPR
>>> tpr = mda.Universe(TPR)
>>> tpr.atoms.ids
array([    0,     1,     2, ..., 47678, 47679, 47680])
>>> tpr.atoms.resids
array([    0,     0,     0, ..., 11299, 11300, 11301])

Currently version of MDAnalysis

jbarnoud commented 4 years ago

This should be trivial to fix, but I am worried it will break many user scripts.

lilyminium commented 4 years ago

That's an important consideration, but for consistency there should be a meaningful distinction between ids/resids (read from the file, indexed from 1) and indices/resindices (derived, indexed from 0). Warnings?

zemanj commented 4 years ago

As @jbarnoud already said, simply changing the behavior will probably get many users into trouble. Even with a warning it might be more annoying than useful for most users.

but for consistency...

That's a valid point. Nevertheless, I can't think of any good reasoning for using 1-based indexing apart from the fact that Gromacs does it. How many other formats do we have that support resids? If there is at least one more that uses zero-based indexing, I strongly suppose we keep the current behavior. From my point of view, I don't want MDA to behave like I would expect from a particular format. I want it to behave the same way no matter what kind of format I feed to it.

lilyminium commented 4 years ago

How many other formats do we have that support resids? If there is at least one more that uses zero-based indexing,

Results vary.

>>> import MDAnalysis as mda
>>> mda.__version__
'0.20.1'
>>> from MDAnalysis.tests.datafiles import *

PSF and Desmond have atom ids from 0, but resids from 1. No formats other than TPRParser index resid from 0

>>> psf = mda.Universe(PSF)
>>> psf.atoms.ids
array([   0,    1,    2, ..., 3338, 3339, 3340])
>>> psf.atoms.resids
array([  1,   1,   1, ..., 214, 214, 214])

The PSFParser reads the file and subtracts 1 for the atom id, rather than just indexing found atoms. (Why?) It also reads the file for resid but does not subtract 1.

>>> dms = mda.Universe(DMS)
>>> dms.atoms.ids
array([   0,    1,    2, ..., 3338, 3339, 3340])
>>> dms.atoms.resids
array([  1,   1,   1, ..., 214, 214, 214])

The DMSParser queries the Desmond database for atom information. Desmond indexes from 0 (17.1.1). Apparently it indexes resids from 1, I guess.

Everything else indexes ids and resids from 1, one way or another

crd

>>> crd = mda.Universe(CRD)
>>> crd.atoms.ids
array([   1,    2,    3, ..., 3339, 3340, 3341])
>>> crd.atoms.resids
array([  1,   1,   1, ..., 214, 214, 214])

pdb

>>> pdb = mda.Universe(PDB)
>>> pdb.atoms.ids
array([    1,     2,     3, ..., 47679, 47680, 47681])
>>> pdb.atoms.resids
array([    1,     1,     1, ..., 11300, 11301, 11302])

Extended pdb

>>> xpdb = mda.Universe(XPDB_small)
>>> xpdb.atoms.ids
array([1, 2, 3, 4, 5])
>>> xpdb.atoms.resids
array([   1,   10,  100, 1000, 1000])

gro

>>> gro = mda.Universe(GRO)
>>> gro.atoms.ids
array([    1,     2,     3, ..., 47679, 47680, 47681])
>>> gro.atoms.resids
array([    1,     1,     1, ..., 11300, 11301, 11302])

xyz

>>> xyz = mda.Universe(XYZ)
>>> xyz.atoms.ids
array([   1,    2,    3, ..., 1282, 1283, 1284])
>>> xyz.atoms.resids
array([1, 1, 1, ..., 1, 1, 1])

txyz

>>> txyz = mda.Universe(TXYZ)
>>> txyz.atoms.ids
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> txyz.atoms.resids
array([1, 1, 1, 1, 1, 1, 1, 1, 1])

prmtop

>>> prm = mda.Universe(PRM)
>>> prm.atoms.ids[:10]
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
>>> prm.atoms.resids[:10]
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

pqr

>>> pqr = mda.Universe(PQR)
>>> pqr.atoms.ids
array([   1,    2,    3, ..., 3339, 3340, 3341])
>>> pqr.atoms.resids
array([  1,   1,   1, ..., 214, 214, 214])

pdbqt

>>> pdbqt = mda.Universe(PDBQT_input)
>>> pdbqt.atoms.ids
array([   1,    2,    3, ..., 1803, 1804, 1805])
>>> pdbqt.atoms.resids
array([ 2,  2,  2, ..., 99, 99, 99])

mol2

>>> mol2 = mda.Universe(mol2_molecules)
>>> mol2.atoms.ids
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
>>> mol2.atoms.resids
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])

mmtf

>>> mmtf = mda.Universe(MMTF)
>>> mmtf.atoms.ids[:5]
array([1, 2, 3, 4, 5])
>>> mmtf.atoms.resids[:5]
array([1, 1, 1, 1, 1])

GAMESS

>>> gms = mda.Universe(GMS_ASYMOPT)
>>> gms.atoms.ids
array([1, 2, 3, 4, 5, 6])
>>> gms.atoms.resids
array([1, 1, 1, 1, 1, 1])

DL Poly

>>> dlpconfig = mda.Universe(DLP_CONFIG, format='config')
>>> dlpconfig.atoms.ids[:5]
array([1, 2, 3, 4, 5])
>>> dlpconfig.atoms.resids[:5]
array([1, 1, 1, 1, 1])
>>> dlphistory = mda.Universe(DLP_HISTORY, format='history')
>>> dlphistory.atoms.ids[:5]
array([1, 2, 3, 4, 5])
>>> dlphistory.atoms.resids[:5]
array([1, 1, 1, 1, 1])

HOOMD XML

>>> hoomdxml.atoms.ids[:5]
>>> hoomdxml = mda.Universe(HoomdXMLdata)
array([1, 2, 3, 4, 5])
>>> hoomdxml.atoms.resids[:5]
array([1, 1, 1, 1, 1])

Thoughts

From my point of view, I don't want MDA to behave like I would expect from a particular format. I want it to behave the same way no matter what kind of format I feed to it.

IMO (as a user) the behaviour of MDAnalysis right now is to read attributes from file formats when it can, which includes ids/resids. This means that the same system in different formats might have different attributes available, which is currently the case. The 'canonical'(?) attributes of indices/resindices wouldn't change across formats.

I am worried it will break many user scripts.

You could also warn the other way -- keep current behaviour and warn new users? We've already had an indexing error tank a project in our group before, so I'm very alert for new ones!

I can note diversions from expected behaviour in the documentation when I've worked out what expected behaviour is.

jbarnoud commented 4 years ago

The resids are meant to be what the users have in their input files. It is how the users refer to their residues other than in MDAnalysis. For instance, if your first residue is numbered 5 in a PDB file, then the resid for that first residue should be 5. If I recall correctly, we have resnum for the 0 indexed, continuous residue serial (to be checked, fuzzy memory here).

Following that logic, it is meaningful to index the TPR from 1 since the GRO and ITP files, the TPR is built from, are indexed from 1.

Though, there are many scripts out there that will be silently broken, and I am afraid a warning will not be enough to prevent wrong analyses to issue. So I would be in favour of warning in the doc that it is 0-indexed. It is indeed important that diversion from the expected behaviour are minimized or at least documented.

zemanj commented 4 years ago

Thanks @lilyminium for the detailed analysis!

So the expected behavior is definitely that TPRParser should index resid from 1.

lilyminium commented 4 years ago

@jbarnoud I was meaning to ask about resnum. It seems like an alias of resid at the moment -- it doesn't seem to behave differently in any instances.

>>> tpr.atoms.resnums
array([    0,     0,     0, ..., 11299, 11300, 11301])
>>> psf.atoms.resnums
array([  1,   1,   1, ..., 214, 214, 214])
>>> dms.atoms.resnums
array([  1,   1,   1, ..., 214, 214, 214])
>>> pdbqt.atoms.resids
array([ 2,  2,  2, ..., 99, 99, 99])
>>> pdbqt.atoms.resnums
array([ 2,  2,  2, ..., 99, 99, 99])

etc.

indices/resindices/segindices are the 0-indexed, continuous counters for atoms/residues/segments. I have an ongoing table of notes at the UserGuide wiki

jbarnoud commented 4 years ago

I may have mistaken resnum for resindex. The doc of the selection language seem to weight for resnum and resindex being synonymous:

resid residue-number-range

resid can take a single residue number or a range of numbers. A range consists of two numbers separated by a colon (inclusive) such as resid 1:5. A residue number (“resid”) is taken directly from the topology.

resnum resnum-number-range

resnum is the canonical residue number; typically it is set to the residue id in the original PDB structure.

orbeckst commented 4 years ago

resid vs resnum

History

At some point in the dim and distance past the intention had been that resnum is whatever the file gives us (just a tag that we carry around) and resid is MDAnalysis own idea of residue numbering, starting at 0 – this is how VMD handles... IIRC(?).

Clearly, that's not what we have.

ix

Since the topology rewrite #363 we really use the .ix indices as the internal numbering (atom.ix, residue.ix, segment.ix for the number, atoms.ix/atoms.indices, atoms.resindices, atoms.segindices). Note:

>>> u.atoms.indices is u.atoms.ix
True
>>> u.residues.resindices is u.residues.ix
True
>>> u.segments.segindices is u.segments.ix
True

but

>>> len(u.atoms.resindices)
47681
>>> len(u.residues.resindices)
11302

(I used

import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.TPR, data.XTC)

for the example above.)

Suggested use (?)

Currently

(Note that for traditional selections we always had bynum which uses 1-based atom indices and index for the "ix" indices. However, we never bother with storing atom numbers from files so bynum is just index + 1.)

What's wrong here?

There are many issues with attributes and their meaning, many of which @lilyminium collected in a gallery of horrors #2362. Many of these also show up when users have issues with the PDB format and related formats, for instance

I think the main problem is that we do not properly distinguish between data that we associate unchanged with particles (I look at them as "tags") and data that MDAnalysis maintains in a consistent form. The latter is generated by the topology parsers and is initially based on the data in the files. But we should then decouple these two instead of doing messy things such as pretending that segid and chainID are the same thing or changing the resid to a 0-based index.

Possible solution: "Tags"/Attributes and MDAnalysis indices

The "new" topology system makes it super-easy (relatively speaking) to associate arbitrary attributes with atoms. It would be helpful to have a list of common attributes that most people expect and would like to access under the same name, regardless of file format. We would then teach our parsers to populate the "common" attributes with the closest equivalent, possibly keeping the native names in custom attributes as well, and maintain the MDAnalysis view (atomindex, residueindex, segmentindex, mass, charge, ... anything else?).

Unfortunately, expectations are likely different in different domains. For biomolecular simulations something like

The "common" attributes would have keywords in the selection language. The custom attributes would just be accessible as attributes for pandas-style boolean selections.

orbeckst commented 4 years ago

TPR parser

And from @lilyminium 's summary above, I also think that the TPR parser should use 1-based resids.

jbarnoud commented 4 years ago

We would then teach our parsers to populate the "common" attributes with the closest equivalent

Do you mean guessed by default, or populate what you know? Because the guesses are really only pertinent for fully atomistic simulations, and can break fairly easily beyond the most common molecules.

KCLPhysics commented 4 years ago

It sounds like there are quite a few issues with the way attributes are handled that could be resolved by making a number of common attributes consistently available. In terms of TPR resids being 1-based, we have a couple of thoughts to add.

Having used gro or pdb files for most of our work, when switching to tpr files to have access to partial charges, we did not expect that the resids would change. We also did not think to check the documentation for the tpr parser as we had no reason to believe indices would be different across the two file formats. So, in essence, there is already a silent break. There may well be users out there who have not yet realised that tpr files have 0-based resids. Further, by not changing the resids to be 1-based, we will require every future user of the tpr parser to read the documentation to discover this unexpected behaviour.

It therefore seems like a good idea to move to 1-based indices for tpr files. One way to do this, without requiring too many changes to current scripts, would be to require an index argument to be passed to mda.Universe() when using tpr files. Current users who do not want their scripts to break could specify index=0, whilst we could recommend new users set index=1. In future releases, index=1 should be made the default value, or this argument deprecated altogether.

@bieniekmateusz and @p-j-smith

orbeckst commented 4 years ago

We would then teach our parsers to populate the "common" attributes with the closest equivalent

Do you mean guessed by default, or populate what you know? Because the guesses are really only pertinent for fully atomistic simulations, and can break fairly easily beyond the most common molecules.

I primarily mean "things that we get from the file but might have different names in different file formats" such as using the PDB ChainID as a value for the MDA segment identifier or making PDB's resSeq available as resid. There are also gray areas. For XYZ we have no numbering information whatsoever and so make up the atom IDs as the sequence from 1 to N_atoms.

Guessing is always a problem. The purist's approach is of course to just not guess anything and throw an error; or in our case, do not add certain attributes or methods (e.g., no center_of_mass() if there are no masses) or leave values empty (or set them arbitrarily, such as adding all atoms in an XYZ topology as a single residue with resid "1").

The convenience of guessing can be high, though, although this needs to be improved, and only be done by explicit user settings, see issue #2348.

orbeckst commented 4 years ago

@lilyminium collected common attributes in https://github.com/MDAnalysis/UserGuide/wiki/Topology-attributes

orbeckst commented 3 years ago

2.0.0 is going to be a good place to fix this issue, with the ugly user-alerting stuff in a 1.1.0 release.

My view after rereading the whole discussion is:

We should revisit the purpose of resnum which is probably not used a lot. We could make resnum the attribute that we extract from the file but perhaps dropping it completely and rather informing users about the internal indices might be clearer and reduce clutter/confusion.

lilyminium commented 3 years ago

We could make resnum the attribute that we extract from the file

resnum is currently synonymous with resid everywhere but atom selection language^ so making it mean something different for the TPRParser could be much more confusing than recommending resindices.

^ In current develop, you can select "resid 20A:23B" but not "resnum 20A:23B". "resid 20:23" should return the same selection as "resnum 20:23" but "resid 23:20" does not return the empty AtomGroup that "resnum 23:20" does.

orbeckst commented 3 years ago

So, scrap the suggestion we could repurpose resnum. Perhaps just ditch it – we can deprecated it in a 1.1.

IAlibay commented 3 years ago

I'm modifying the milestone here to 1.0.2 so that we can make a decision on whether we want to go ahead with the deprecation for this now so that we can fix this in 2.0. We need to agree on this soon as we'd like to get 1.0.2 out ASAP.

orbeckst commented 3 years ago

I am in favor of

EDIT: Let's wrap up MDA 1.x with this release. If we add kwargs to Universe then according to semver we need to call it 1.1.0, though.

orbeckst commented 3 years ago

(The purist in me still does not particularly like that we are changing resid from what's inside the file. However, I feel that resid is so commonly used by users without thinking that it's an internal attribute that we have to make the change. If we want to keep track of internal values then we should come up with a bunch of attributes that make this clear, e.g. tpr_resid or pdb_resid or just a raw namespace so that raw.resid is what "resid" is in the TPR.)

lilyminium commented 3 years ago

However, I feel that resid is so commonly used by users without thinking that it's an internal attribute that we have to make the change.

I think many users don't expect this because a) tpr files are not human readable and b) gro files, also supported by gromacs and human readable, index the exact same system from 1. In favour of supporting indexing from 1, but maybe printing warnings that we'll do so in 2.0 in 1.0.2, and printing warnings in 2.0 that we have now done this.

Edit: ... as you already said in https://github.com/MDAnalysis/mdanalysis/issues/2364#issuecomment-695039389

orbeckst commented 3 years ago

I think many users don't expect this [...]

  1. Sorry, just to clarify: By "this" you mean "keeping TPR resid 0-based as in the internals of the TPR format"?
  2. And I read your comment in total that you are supporting making resid 1-based.
  3. Finally, you think that we do not need the tpr_resid_one_based=True|False in 1.1.0 and that a warning of impending doom is sufficient?
lilyminium commented 3 years ago

@orbeckst

  1. Sorry, yes the unexpected "this" is the currently 0-based resids.
  2. Yes
  3. I thought we were just adding a deprecation warning in 1.0.x and actually doing the change in 2.0?

deprecating resnum in 1.0.2 for removal in 2.0

Also, re: deprecating resnum -- it actually does now select different information to resid in a selection string, because resids incorporate altlocs and resnums do not. I think you could replicate resnum selection by just not specifying the altLoc in the selection string of resid, although I'm not too clear on how resid currently works with empty altLocs (@richardjgowers might know off the top of his head?). If not, this could be an argument to keep resnum.

jbarnoud commented 3 years ago

I like the tpr_resid_one_based=True|False. It means users can prepare for the deprecation.

xiki-tempula commented 3 years ago

To add to this whole complexity. The residues id in the gromacs gro file could start from a number that is not one. For example, if I have a protein where the N-terminus is not there and the crystal structure starts from residue 20. The gro file could start with resid 20 but the gromacs tpr residue id will always start from 0.

orbeckst commented 3 years ago

The master (deprecation warning + new kwarg) version PR #3153 was merged; this issue will close with the develop PR #3152 that changes the default behavior.