MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.32k stars 652 forks source link

Consolidate TopologyAttrs with consistent meanings #2362

Open lilyminium opened 5 years ago

lilyminium commented 5 years ago

TopologyAttrs are a bit of a mess right now, and the documentation is not up-to-date. This is a list of some inconsistencies I found.

atomiccharge

The GAMESS parser uses the TopologyAttr atomiccharge for the atomic number.

>>> from MDAnalysis.tests.datafiles import *
>>> gms = mda.Universe(GMS_ASYMOPT)
>>> gms.atoms
<AtomGroup with 6 atoms>
>>> g = gms.atoms[0]
>>> g.atomiccharge
8
>>> g.name
'O'
>>> g.type
'O'
>>> g.element
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lily/anaconda3/envs/mdanalysis/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 3566, in __getattr__
    cls=self.__class__.__name__, attr=attr))
AttributeError: Atom has no attribute element

This is technically correct, but I think most people would expect the atomiccharge to refer to a partial atomic charge. A more informational name might be atomicnumber.

atomnum

The Desmond DMS parser uses the TopologyAttr atomnum. The Desmond User Guide (section 17.1.1) asserts that 'anum' refers to the atomic number of a particle, but this seems implausible:

>>> from MDAnalysis.tests.datafiles import DMS
>>> dms = mda.Universe(DMS)
>>> dms.atoms.names
array(['N', 'HT1', 'HT2', ..., 'C', 'OT1', 'OT2'], dtype=object)
>>> dms.atoms.atomnums
array([0, 0, 0, ..., 0, 0, 0], dtype=int32)

element vs type

A non-exhaustive search indicates that only the prmtop TOPParser implements the element TopologyAttr.

>>> prm = mda.Universe(PRM)
>>> prm.atoms[:6].names
array(['N', 'H1', 'H2', 'H3', 'CA', 'HA'], dtype=object)
>>> prm.atoms[:6].types
array(['N3', 'H', 'H', 'H', 'CT', 'HP'], dtype=object)
>>> prm.atoms[:6].elements
array(['N', 'H', 'H', 'H', 'C', 'H'], dtype=object)

I looked at formats like GAMESS and XYZ that as far as I know explicitly state element, as well as formats like PDB that either read an element column or guess it from the name. type alternately refers to elements:

>>> pdb = mda.Universe(PDB)
>>> pdb.atoms.types
array(['N', 'H', 'H', ..., 'NA', 'NA', 'NA'], dtype=object)
>>> pdb.atoms.names
array(['N', 'H1', 'H2', ..., 'NA', 'NA', 'NA'], dtype=object)

or to force field / Autodock atom types:

>>> pdbqt = mda.Universe(PDBQT_input)
>>> pdbqt.atoms.types
array(['N', 'HD', 'HD', ..., 'C', 'C', 'OA'], dtype=object)
>>> pdbqt.atoms.names
array(['N', 'HN1', 'HN2', ..., 'CA', 'C', 'O'], dtype=object)
>>> pdbqt.atoms.masses
array([14.007,  0.   ,  0.   , ..., 12.011, 12.011,  0.   ])
>>> pdbqt.atoms.elements
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lily/anaconda3/envs/mdanalysis/lib/python3.7/site-packages/MDAnalysis/core/groups.py", line 2278, in __getattr__
    cls=self.__class__.__name__, attr=attr))
AttributeError: AtomGroup has no attribute elements

My opinion is also that the 0 mass for the Hs and O should have raised a warning.

resid vs resnum

I can't see a difference between resnum and resid.

>>> crd = mda.Universe(CRD)
>>> crd.atoms.resnums
array([  1,   1,   1, ..., 214, 214, 214])
>>> crd.atoms.resids
array([  1,   1,   1, ..., 214, 214, 214])
>>> tpr = mda.Universe(TPR)
>>> tpr.atoms.resnums
array([    0,     0,     0, ..., 11299, 11300, 11301])
>>> tpr.atoms.resids
array([    0,     0,     0, ..., 11299, 11300, 11301])

GROMACS indexes from 1 in user-readable files, as well.

chainIDs

The Desmond parser appears to be the only one to implement the chainID TopologyAttr.

>>> dms.atoms.chainIDs
array(['X', 'X', 'X', ..., 'X', 'X', 'X'], dtype=object)

Comments in the code suggest that all other parsers turn chains into segments.

tempfactors vs bfactors

These parsers have tempfactors but not bfactors:

MMTFParser has bfactors but not tempfactors.

>>> mmtf.atoms[:6].bfactors
array([ 9.48, 10.88, 10.88, 10.88, 10.88,  9.48])

TPRParser has neither bfactors nor tempfactors, despite the documentation thinking it does.

This impacts on selection language. Selecting same bfactor as *selection* only works for MMTFs because selection language doesn't recognise the tempfactor attribute.

ids

I don't understand the meaning of atom IDs for file formats like XYZ. TPR atom IDs also start from 0, despite GROMACS indexing from 1.

>>> tpr = mda.Universe(TPR)
>>> tpr.atoms.ids
array([    0,     1,     2, ..., 47678, 47679, 47680])

masses

Failing to guess masses gives masses of 0. This is done atom-by-atom, so users may only check the end of an array like the one below, not realising that they do not have appropriate masses elsewhere. This seems like a bad idea.

>>> mol2.atoms.masses
array([ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   , 18.998,  0.   , 35.45 ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.008,  1.008,
        1.008,  1.008,  1.008,  1.008,  1.008,  1.008,  1.008,  1.008,
        1.008,  1.008,  1.008,  1.008,  1.008,  1.008,  1.008,  1.008,
        1.008])

resnames

The GSDParser has numbers as resnames. Should this be the case?

>>> gsd = mda.Universe(GSD)
>>> gsd.atoms.resnames
array([0, 1, 2, ..., 647, 647, 647], dtype=object)
>>> type(gsd.atoms[0].resname)
<class 'int'>
orbeckst commented 5 years ago

sigh

Thanks for documenting this. It is really helpful that you're taking a global view.

lilyminium commented 5 years ago
>>> gsd = mda.Universe(GSD)
>>> gsd.atoms.resnames
array([0, 1, 2, ..., 647, 647, 647], dtype=object)
>>> gsd.select_atoms("resname 0")
<AtomGroup with 0 atoms>
>>> gsd.select_atoms("resname 2")
<AtomGroup with 0 atoms>

^ A consequence of integer resnames.

richardjgowers commented 5 years ago

Ew gross. So one option is to make some “reserved” topogyattrs statically typed.

On Wed, Oct 23, 2019 at 16:07, Lily Wang notifications@github.com wrote:

gsd = mda.Universe(GSD) gsd.atoms.resnames array([0, 1, 2, ..., 647, 647, 647], dtype=object) gsd.select_atoms("resname 0") <AtomGroup with 0 atoms> gsd.select_atoms("resname 2") <AtomGroup with 0 atoms>

^ A consequence of integer resnames.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2362?email_source=notifications&email_token=ACGSGB4W4MMEQQEIFIH4ZULQQBSCBA5CNFSM4I7MLWZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECBYFAQ#issuecomment-545489538, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB2PUVNK6Y2OZIYAWSLQQBSCBANCNFSM4I7MLWZA .

richardjgowers commented 2 years ago

This would be cool to address in 3.0 as it "breaks" things, but it needs fixing (ping #3800)

orbeckst commented 7 months ago

Consolidating attributes looks like a project to be put on the 3.0 roadmap as it is related to

With PR #3753 almost done, we will have a new guesser infrastructure. We should then clean-up our attributes.

orbeckst commented 7 months ago

And by "putting on the roadmap" I mean "We need to figure out how we will make it actually happen." — something for the next dev meeting.