Closed lilyminium closed 3 years ago
This should be trivial to fix, but I am worried it will break many user scripts.
That's an important consideration, but for consistency there should be a meaningful distinction between ids/resids (read from the file, indexed from 1) and indices/resindices (derived, indexed from 0). Warnings?
As @jbarnoud already said, simply changing the behavior will probably get many users into trouble. Even with a warning it might be more annoying than useful for most users.
but for consistency...
That's a valid point. Nevertheless, I can't think of any good reasoning for using 1-based indexing apart from the fact that Gromacs does it. How many other formats do we have that support resids? If there is at least one more that uses zero-based indexing, I strongly suppose we keep the current behavior. From my point of view, I don't want MDA to behave like I would expect from a particular format. I want it to behave the same way no matter what kind of format I feed to it.
How many other formats do we have that support resids? If there is at least one more that uses zero-based indexing,
Results vary.
>>> import MDAnalysis as mda
>>> mda.__version__
'0.20.1'
>>> from MDAnalysis.tests.datafiles import *
PSF and Desmond have atom ids from 0, but resids from 1. No formats other than TPRParser index resid from 0
>>> psf = mda.Universe(PSF)
>>> psf.atoms.ids
array([ 0, 1, 2, ..., 3338, 3339, 3340])
>>> psf.atoms.resids
array([ 1, 1, 1, ..., 214, 214, 214])
The PSFParser reads the file and subtracts 1 for the atom id, rather than just indexing found atoms. (Why?) It also reads the file for resid but does not subtract 1.
>>> dms = mda.Universe(DMS)
>>> dms.atoms.ids
array([ 0, 1, 2, ..., 3338, 3339, 3340])
>>> dms.atoms.resids
array([ 1, 1, 1, ..., 214, 214, 214])
The DMSParser queries the Desmond database for atom information. Desmond indexes from 0 (17.1.1). Apparently it indexes resids from 1, I guess.
Everything else indexes ids and resids from 1, one way or another
crd
>>> crd = mda.Universe(CRD)
>>> crd.atoms.ids
array([ 1, 2, 3, ..., 3339, 3340, 3341])
>>> crd.atoms.resids
array([ 1, 1, 1, ..., 214, 214, 214])
pdb
>>> pdb = mda.Universe(PDB)
>>> pdb.atoms.ids
array([ 1, 2, 3, ..., 47679, 47680, 47681])
>>> pdb.atoms.resids
array([ 1, 1, 1, ..., 11300, 11301, 11302])
Extended pdb
>>> xpdb = mda.Universe(XPDB_small)
>>> xpdb.atoms.ids
array([1, 2, 3, 4, 5])
>>> xpdb.atoms.resids
array([ 1, 10, 100, 1000, 1000])
gro
>>> gro = mda.Universe(GRO)
>>> gro.atoms.ids
array([ 1, 2, 3, ..., 47679, 47680, 47681])
>>> gro.atoms.resids
array([ 1, 1, 1, ..., 11300, 11301, 11302])
xyz
>>> xyz = mda.Universe(XYZ)
>>> xyz.atoms.ids
array([ 1, 2, 3, ..., 1282, 1283, 1284])
>>> xyz.atoms.resids
array([1, 1, 1, ..., 1, 1, 1])
txyz
>>> txyz = mda.Universe(TXYZ)
>>> txyz.atoms.ids
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> txyz.atoms.resids
array([1, 1, 1, 1, 1, 1, 1, 1, 1])
prmtop
>>> prm = mda.Universe(PRM)
>>> prm.atoms.ids[:10]
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> prm.atoms.resids[:10]
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
pqr
>>> pqr = mda.Universe(PQR)
>>> pqr.atoms.ids
array([ 1, 2, 3, ..., 3339, 3340, 3341])
>>> pqr.atoms.resids
array([ 1, 1, 1, ..., 214, 214, 214])
pdbqt
>>> pdbqt = mda.Universe(PDBQT_input)
>>> pdbqt.atoms.ids
array([ 1, 2, 3, ..., 1803, 1804, 1805])
>>> pdbqt.atoms.resids
array([ 2, 2, 2, ..., 99, 99, 99])
mol2
>>> mol2 = mda.Universe(mol2_molecules)
>>> mol2.atoms.ids
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
>>> mol2.atoms.resids
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1])
mmtf
>>> mmtf = mda.Universe(MMTF)
>>> mmtf.atoms.ids[:5]
array([1, 2, 3, 4, 5])
>>> mmtf.atoms.resids[:5]
array([1, 1, 1, 1, 1])
GAMESS
>>> gms = mda.Universe(GMS_ASYMOPT)
>>> gms.atoms.ids
array([1, 2, 3, 4, 5, 6])
>>> gms.atoms.resids
array([1, 1, 1, 1, 1, 1])
DL Poly
>>> dlpconfig = mda.Universe(DLP_CONFIG, format='config')
>>> dlpconfig.atoms.ids[:5]
array([1, 2, 3, 4, 5])
>>> dlpconfig.atoms.resids[:5]
array([1, 1, 1, 1, 1])
>>> dlphistory = mda.Universe(DLP_HISTORY, format='history')
>>> dlphistory.atoms.ids[:5]
array([1, 2, 3, 4, 5])
>>> dlphistory.atoms.resids[:5]
array([1, 1, 1, 1, 1])
HOOMD XML
>>> hoomdxml.atoms.ids[:5]
>>> hoomdxml = mda.Universe(HoomdXMLdata)
array([1, 2, 3, 4, 5])
>>> hoomdxml.atoms.resids[:5]
array([1, 1, 1, 1, 1])
Thoughts
From my point of view, I don't want MDA to behave like I would expect from a particular format. I want it to behave the same way no matter what kind of format I feed to it.
IMO (as a user) the behaviour of MDAnalysis right now is to read attributes from file formats when it can, which includes ids/resids. This means that the same system in different formats might have different attributes available, which is currently the case. The 'canonical'(?) attributes of indices/resindices wouldn't change across formats.
I am worried it will break many user scripts.
You could also warn the other way -- keep current behaviour and warn new users? We've already had an indexing error tank a project in our group before, so I'm very alert for new ones!
I can note diversions from expected behaviour in the documentation when I've worked out what expected behaviour is.
The resids
are meant to be what the users have in their input files. It is how the users refer to their residues other than in MDAnalysis. For instance, if your first residue is numbered 5 in a PDB file, then the resid for that first residue should be 5. If I recall correctly, we have resnum
for the 0 indexed, continuous residue serial (to be checked, fuzzy memory here).
Following that logic, it is meaningful to index the TPR from 1 since the GRO and ITP files, the TPR is built from, are indexed from 1.
Though, there are many scripts out there that will be silently broken, and I am afraid a warning will not be enough to prevent wrong analyses to issue. So I would be in favour of warning in the doc that it is 0-indexed. It is indeed important that diversion from the expected behaviour are minimized or at least documented.
Thanks @lilyminium for the detailed analysis!
So the expected behavior is definitely that TPRParser
should index resid
from 1.
TPRparser
is used and resid
accessed, and provide a switch to deliberately disable the exception. Sounds like a rather dirty and annoying thing, though.resindex
should be used preferredly since that is more portable.@jbarnoud I was meaning to ask about resnum
. It seems like an alias of resid
at the moment -- it doesn't seem to behave differently in any instances.
>>> tpr.atoms.resnums
array([ 0, 0, 0, ..., 11299, 11300, 11301])
>>> psf.atoms.resnums
array([ 1, 1, 1, ..., 214, 214, 214])
>>> dms.atoms.resnums
array([ 1, 1, 1, ..., 214, 214, 214])
>>> pdbqt.atoms.resids
array([ 2, 2, 2, ..., 99, 99, 99])
>>> pdbqt.atoms.resnums
array([ 2, 2, 2, ..., 99, 99, 99])
etc.
indices/resindices/segindices are the 0-indexed, continuous counters for atoms/residues/segments. I have an ongoing table of notes at the UserGuide wiki
I may have mistaken resnum
for resindex
. The doc of the selection language seem to weight for resnum
and resindex
being synonymous:
resid residue-number-range
resid can take a single residue number or a range of numbers. A range consists of two numbers separated by a colon (inclusive) such as resid 1:5. A residue number (“resid”) is taken directly from the topology.
resnum resnum-number-range
resnum is the canonical residue number; typically it is set to the residue id in the original PDB structure.
At some point in the dim and distance past the intention had been that resnum is whatever the file gives us (just a tag that we carry around) and resid is MDAnalysis own idea of residue numbering, starting at 0 – this is how VMD handles... IIRC(?).
Clearly, that's not what we have.
Since the topology rewrite #363 we really use the .ix
indices as the internal numbering (atom.ix
, residue.ix
, segment.ix
for the number, atoms.ix
/atoms.indices
, atoms.resindices
, atoms.segindices
). Note:
>>> u.atoms.indices is u.atoms.ix
True
>>> u.residues.resindices is u.residues.ix
True
>>> u.segments.segindices is u.segments.ix
True
but
>>> len(u.atoms.resindices)
47681
>>> len(u.residues.resindices)
11302
(I used
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.TPR, data.XTC)
for the example above.)
Currently
select_atoms()
use resid or resnum.(Note that for traditional selections we always had bynum
which uses 1-based atom indices and index
for the "ix" indices. However, we never bother with storing atom numbers from files so bynum
is just index + 1
.)
There are many issues with attributes and their meaning, many of which @lilyminium collected in a gallery of horrors #2362. Many of these also show up when users have issues with the PDB format and related formats, for instance
resid
= resnum
#2362I think the main problem is that we do not properly distinguish between data that we associate unchanged with particles (I look at them as "tags") and data that MDAnalysis maintains in a consistent form. The latter is generated by the topology parsers and is initially based on the data in the files. But we should then decouple these two instead of doing messy things such as pretending that segid and chainID are the same thing or changing the resid to a 0-based index.
The "new" topology system makes it super-easy (relatively speaking) to associate arbitrary attributes with atoms. It would be helpful to have a list of common attributes that most people expect and would like to access under the same name, regardless of file format. We would then teach our parsers to populate the "common" attributes with the closest equivalent, possibly keeping the native names in custom attributes as well, and maintain the MDAnalysis view (atomindex, residueindex, segmentindex, mass, charge, ... anything else?).
Unfortunately, expectations are likely different in different domains. For biomolecular simulations something like
The "common" attributes would have keywords in the selection language. The custom attributes would just be accessible as attributes for pandas-style boolean selections.
And from @lilyminium 's summary above, I also think that the TPR parser should use 1-based resids.
We would then teach our parsers to populate the "common" attributes with the closest equivalent
Do you mean guessed by default, or populate what you know? Because the guesses are really only pertinent for fully atomistic simulations, and can break fairly easily beyond the most common molecules.
It sounds like there are quite a few issues with the way attributes are handled that could be resolved by making a number of common attributes consistently available. In terms of TPR resids being 1-based, we have a couple of thoughts to add.
Having used gro or pdb files for most of our work, when switching to tpr files to have access to partial charges, we did not expect that the resids would change. We also did not think to check the documentation for the tpr parser as we had no reason to believe indices would be different across the two file formats. So, in essence, there is already a silent break. There may well be users out there who have not yet realised that tpr files have 0-based resids. Further, by not changing the resids to be 1-based, we will require every future user of the tpr parser to read the documentation to discover this unexpected behaviour.
It therefore seems like a good idea to move to 1-based indices for tpr files. One way to do this, without requiring too many changes to current scripts, would be to require an index
argument to be passed to mda.Universe()
when using tpr files. Current users who do not want their scripts to break could specify index=0
, whilst we could recommend new users set index=1
. In future releases, index=1
should be made the default value, or this argument deprecated altogether.
@bieniekmateusz and @p-j-smith
We would then teach our parsers to populate the "common" attributes with the closest equivalent
Do you mean guessed by default, or populate what you know? Because the guesses are really only pertinent for fully atomistic simulations, and can break fairly easily beyond the most common molecules.
I primarily mean "things that we get from the file but might have different names in different file formats" such as using the PDB ChainID as a value for the MDA segment identifier or making PDB's resSeq
available as resid. There are also gray areas. For XYZ we have no numbering information whatsoever and so make up the atom IDs as the sequence from 1 to N_atoms.
Guessing is always a problem. The purist's approach is of course to just not guess anything and throw an error; or in our case, do not add certain attributes or methods (e.g., no center_of_mass()
if there are no masses) or leave values empty (or set them arbitrarily, such as adding all atoms in an XYZ topology as a single residue with resid "1").
The convenience of guessing can be high, though, although this needs to be improved, and only be done by explicit user settings, see issue #2348.
@lilyminium collected common attributes in https://github.com/MDAnalysis/UserGuide/wiki/Topology-attributes
2.0.0 is going to be a good place to fix this issue, with the ugly user-alerting stuff in a 1.1.0 release.
My view after rereading the whole discussion is:
gmx dump
).Universe(TPR, tpr_resid_one_based=True|False)
required for parsing TPR files (and raising an exception otherwise) might be sufficient, especially if we can do this in 1.x (we can make a 1.1.0 just for this purpose). Then 2.x could be clean (only supporting 1-based resids).We should revisit the purpose of resnum
which is probably not used a lot. We could make resnum
the attribute that we extract from the file but perhaps dropping it completely and rather informing users about the internal indices might be clearer and reduce clutter/confusion.
We could make resnum the attribute that we extract from the file
resnum
is currently synonymous with resid
everywhere but atom selection language^ so making it mean something different for the TPRParser could be much more confusing than recommending resindices
.
^ In current develop
, you can select "resid 20A:23B" but not "resnum 20A:23B". "resid 20:23" should return the same selection as "resnum 20:23" but "resid 23:20" does not return the empty AtomGroup that "resnum 23:20" does.
So, scrap the suggestion we could repurpose resnum. Perhaps just ditch it – we can deprecated it in a 1.1.
I'm modifying the milestone here to 1.0.2 so that we can make a decision on whether we want to go ahead with the deprecation for this now so that we can fix this in 2.0. We need to agree on this soon as we'd like to get 1.0.2 out ASAP.
I am in favor of
resnum
in 1.0.2 for removal in 2.0tpr_resid_one_based=True|False
hack and warning in a 1.x (we could do a 1.1 just for it – would also make the CHANGELOG very clear and more in-line with semantic versioning) EDIT: Let's wrap up MDA 1.x with this release. If we add kwargs to Universe then according to semver we need to call it 1.1.0, though.
(The purist in me still does not particularly like that we are changing resid from what's inside the file. However, I feel that resid is so commonly used by users without thinking that it's an internal attribute that we have to make the change. If we want to keep track of internal values then we should come up with a bunch of attributes that make this clear, e.g. tpr_resid
or pdb_resid
or just a raw
namespace so that raw.resid
is what "resid" is in the TPR.)
However, I feel that resid is so commonly used by users without thinking that it's an internal attribute that we have to make the change.
I think many users don't expect this because a) tpr files are not human readable and b) gro files, also supported by gromacs and human readable, index the exact same system from 1. In favour of supporting indexing from 1, but maybe printing warnings that we'll do so in 2.0 in 1.0.2, and printing warnings in 2.0 that we have now done this.
Edit: ... as you already said in https://github.com/MDAnalysis/mdanalysis/issues/2364#issuecomment-695039389
I think many users don't expect this [...]
tpr_resid_one_based=True|False
in 1.1.0 and that a warning of impending doom is sufficient? @orbeckst
deprecating
resnum
in 1.0.2 for removal in 2.0
Also, re: deprecating resnum
-- it actually does now select different information to resid
in a selection string, because resids incorporate altlocs and resnums do not. I think you could replicate resnum
selection by just not specifying the altLoc
in the selection string of resid
, although I'm not too clear on how resid currently works with empty altLocs (@richardjgowers might know off the top of his head?). If not, this could be an argument to keep resnum
.
I like the tpr_resid_one_based=True|False
. It means users can prepare for the deprecation.
To add to this whole complexity. The residues id in the gromacs gro file could start from a number that is not one. For example, if I have a protein where the N-terminus is not there and the crystal structure starts from residue 20. The gro file could start with resid 20 but the gromacs tpr residue id will always start from 0.
The master (deprecation warning + new kwarg) version PR #3153 was merged; this issue will close with the develop PR #3152 that changes the default behavior.
Expected behavior
My expected behaviour if is that if I select a residue in
gmx make_ndx
by residue number 1, and I select a residue withu.select_atoms('resid 1')
, they give me the same residue.Actual behavior
I would get different residues, because GROMACS indexes from 1, but TPRParser indexes from 0. TPR files may index from 0, but they are typically created from human-readable .top or .itp files that are indexed from 1.
Code to reproduce the behavior
Currently version of MDAnalysis
python -c "import MDAnalysis as mda; print(mda.__version__)"
) 0.20.1python -V
)? 3.7.3