Open RMeli opened 5 years ago
Thanks for the useful summary.
We need to better define which quantities are "tags" that we just read from the file (e.g., icode is used in this way, resnum ought to behave in this way but I am not sure if it always does) and which ones have a universal meaning in MDAnalysis that is independent from the original file format. I think we were moving towards making the internal "ix" indices (such as atoms.ix
and residues.ix
) also properly available for the user, but at the moment, they are not accessible via the selection language. That leaves the selections with quantities that can have subtly different meanings in different file formats.
Summary of my ramblings: Make a suggestion how you think it should work, ideally with examples how it will work for data coming from different file formats (PDB, GRO, PSF, TPR, ...).
I completely understand that different formats have different needs (and agree that the documentation could be clearer about universal quantities and tags). The only thing I found quite strange is that I can select a single residue with a particular resid
but then the resid
attribute of that selection is different:
import MDAnalysis as mda
u = mda.Universe("test.pdb")
ag = u.select_atoms("resid 30A")
print(ag)
print(ag.resids)
<AtomGroup [<Atom 39: N of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 40: H of type H of resname ASN, resid 30 and segid A and altLoc >, <Atom 41: CA of type C of resname ASN, resid 30 and segid A and altLoc >, ..., <Atom 50: ND2 of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 51: 1HD2 of type H of resname ASN, resid 30 and segid A and altLoc >, <Atom 52: 2HD2 of type H of resname ASN, resid 30 and segid A and altLoc >]>
[30 30 30 30 30 30 30 30 30 30 30 30 30 30]
As you can see, the residue 30A has been selected correctly, but then the resids
of the atom group are different. Interestingly enough (since I didn't dig into the source code), the information is not completely lost:
ag.select_atoms("resid 30A")
<AtomGroup with 14 atoms>
But this is even more confusing, since looking at print(ag)
one could be tempted to do the following:
ag.select_atoms("resid 30")
<AtomGroup with 0 atoms>
I guess my point is that the fact that the resid
used for selection and the one printed out do not match is confusing and potentially error prone.
My suggestion would be to store full residue information in resid
. Such information should allow to uniquely identify a single residue. This will of course depend on the format (I'm not knowledgeable in many of them...); for some formats this will simply be the equal to resnum
but for other formats it could be slightly different. For PDB files, for example, it would be the residue number plus the insertion code (from the PDB File Format: The combination of residue numbering and insertion code defines the unique residue.)
Hey @RMeli I'm the one responsible for most of this behaviour, but I'm not really an expert on all these corner cases, so I'm happy to be corrected! Btw if you do something like select_atoms('resid 30A-30C')
it should select the range too.
WRT resnum - I don't actually know what this is, it's something that existed before my time on MDA, and I kept around for backwards compat reasons... Maybe we should remove it (ie alias it to resid).
Currently ICodes are treated as a completely separate attribute, so as independents as charge. Reading this has made me think maybe we should instead lump it together with Resids...
So currently resids are created by the file parsing creating one of these TopologyAttr
objects:
https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/topologyattrs.py#L1322
And passing it to the Universe. Universe
then monkeypatches AtomGroup
s to let them know that they have resids. (This is why if you load an XYZ file, the AtomGroups won't even have a .charge
attribute). A Group
can then query the TopologyAttr
object via the get_x
method.
But maybe PDB files (and anything else that provides both Resid and ICode) should instead create a single ResidAndICode
attribute, which then could return 30A
as .resid
attribute. So the TopologyAttr
would instead hold 2 numpy arrays.
So things that should get changed
Resid
attribute to also handle icodes since they're linked.If you want to work on any of those I'm happy to help you with it.
Hi @richardjgowers. I just tested
u.select_atoms("resid 30-30B")
and it works as expected (even if the first residue has no insertion code), which is great! I'm more and more convinced that resid
s behave as expected, but the residue.resid
attribute is simply not a faithful representation.
I didn't know resnum
was there only for backward compatibility. I thought the idea was that resid
is an unique residue identifier while resnum
is the residue number (which is some formats, such as PDB, does not uniquely identify a residue). What do you think of having them both, but with a clear distinction? resid
could be defined as an unique identifier of a single residue (or a group of mutually exclusive residues, if there are alternative locations...) and resnum
as the residue number (equal to resid
in some formats). resnum
could be useful to select all the residues with same residue number but with different (or no) insertion code.
The downside of this approach is that resid
could be a string or a number, depending on the format; how do you feel about that?
I would be happy to work on this improvement with some guidance, but at the moment I'm quite busy with the last GSoC leg with another organisation and therefore it will have to wait until September.
As a weird hack you could change the get
method of resids to check if there's icodes and return a version with icodes appended. It gets messy because then the return type has to be strings, whereas I think a lot of people might expect an integer array. This is going to break a lot of code where people are doing arithmetic with their resids.
FWIW resids are far from unique, we often see files where these have looped around.
I think what you're describing for resnum
is what we currently have for resindex
. Ie a number generated by MDAnalysis which enumerates the residues we think the topology represents.
Yes, I think such a change would break a lot of code and could be introduced only in a major revision. However, to fix the broken codes it would be sufficient to switch from resid
to resnum
when storing such attributes (during selection they work as expected).
I don't need to add the change in an hacky way at the moment since I'm only using select_atoms
which appears to work correctly from my tests (reported here). However, problems will arise if one extracts resid
s for later use and therefore I wanted to raise the problem.
If I understand correctly, resindex
is an internal representation and therefore is different from resnum
in the sense that resnum
should still correspond to the numbering in the original file. But my understanding or assumptions could be completely wrong!
How does VMD handle resid vs resnum?
EDIT: And PyMOL, Chimera, ... ?
https://github.com/MDAnalysis/mdanalysis/issues/2308#issuecomment-517639072
For PDB files, for example, it would be the residue number plus the insertion code (from the PDB File Format: The combination of residue numbering and insertion code defines the unique residue.)
The latter is not actually true: you also need a chain identifier because you can have the same resid in multiple chains. Furthermore, everything goes to hell in a hand basket when we deal with PDB files from simulations where resID wraps around after 9999, as @richardjgowers alluded to in https://github.com/MDAnalysis/mdanalysis/issues/2308#issuecomment-517699188.
MDAnalysis has to deal with the problem that formats like PDB are used in different contexts and everbody would like their files to behave the way that they expect. What's difficult for us is figuring out what these expectations are (especially when they differ from the published standard). Therefore the points that you're raising are very valuable!
Yes, there can be multiple chains of course. And now that you point this out, I find the wording on the PDB Format Specification somewhat misleading...
I completely understand that there are many different contexts for PDB files alone, this is why I opened this discussion as Question and not as Issue. ; )
I wanted to raise the awareness that the following code is somewhat problematic:
ATOM 428 2HD2 LEU A 30 23.678 -17.592 2.996 1.00 0.00 H
ATOM 429 3HD2 LEU A 30 22.002 -17.119 3.368 1.00 0.00 H
ATOM 430 N ASN A 30A 24.832 -12.436 1.742 1.00 21.44 N
ATOM 431 H ASN A 30A 24.010 -12.091 1.207 1.00 0.00 H
ATOM 432 CA ASN A 30B 25.424 -11.537 2.735 1.00 21.84 C
ATOM 433 HA ASN A 30B 26.512 -11.534 2.668 1.00 0.00 H
import MDAnalysis as mda
u = mda.Universe("test.pdb")
ag = u.select_atoms("resid 30A")
print(ag)
resids = set(ag.resids)
for resid in resids:
agw = u.select_atoms(f"resid {resid}")
print(agw)
<AtomGroup [<Atom 3: N of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 4: H of type H of resname ASN, resid 30 and segid A and altLoc >]>
<AtomGroup [<Atom 1: 2HD2 of type H of resname LEU, resid 30 and segid A and altLoc >, <Atom 2: 3HD2 of type H of resname LEU, resid 30 and segid A and altLoc >]>
and find out if this behaviour is expected/wanted or not.
Maybe it's just a matter of updating the documentation to make clear that resid
can be used to select residues with insertion codes but the resid
attribute does not contain that information and it has to be accessed explicitly with icode
?
@RMeli thanks – not much I could do since then but I am adding this issue to a growing list of interconnected problems. Your comments are much appreciated, sorry that we haven't been doing anything since.
No pronlem @orbeckst, I've been following a bit the related issue and completely understand that there is much to be done and little time. I too have been quite busy lately, but I'll be happy to help where I can!
Replying to @RMeli https://github.com/MDAnalysis/mdanalysis/issues/2672#issuecomment-619344384 here because it seems more relevant:
I think suddenly making resid
string-type would break a lot of code out there, including mine. 🙃
If a new topology attribute is added, it's probably best that the special resid-icode selection get moved to that token to avoid the mismatch here between the selection string and the actual attribute.
Would it ever be an issue if the new ResLabels didn't match str(Resids)+(Icodes)? Currently all other topology attributes are independent of each other. For example someone tweaks their icodes attribute and expects that to show up in the reslabels and selection results.
Miscellaneous notes:
resnum
is an alias for resid
in (iirc) all topology readers. .resnum
or .resnums
shows up in the MDAnalysis codebase is once in a docstring in analysis.rms
, and then in some tests. Every other reference to numeric residue labels uses .resids
. Changing the type and content of this attribute should really be considered very carefully.I agree changing resid
to a string is not a viable option (and will have consequences on performance). The problem I was rising here is that resid
works correctly on selection, but the attribute stored does not match the selection (resid 1A
will correctly select residue 1A
, but the resid
attribute will only contain 1
).
I think issues #2669 and #2672 can be solved by using resindices
internally, but the problem of resid
not being bijective remains.
PS: I usually compare my selections with PyMol, which seems to act as I expect.
Ah, we may have been talking past each other a little. From my perspective then the resid
attribute values are fine as is. Instead, it's ResidSelection
that has unexpected behaviour, and selecting "resid 1A"
should give an error; you should be required to select "resid 1 and icode A"
for the 1A residue. Of course this will annoy anyone who uses the resid shorthand...
Yes, I think I might be interpreting resid
in a different way... =)
I interpret it as it is used in select_atoms
where if you have residues 1
, 1A
and 1B
the selection resid 1
will only select 1
(in contrast, resnum 1
will select all three). That is, for me resid
should be an unique identifier.
They way you interpret resid
is the way I interpret resnum
; selecting resnum 1A
will indeed throw a SelectionError: Selection failed: 'Failed to parse number: 1A'
while resnum 1 and icode A
works as expected.
More general question: do we have a comparison of how different commonly used (bio) comp chem software does selections? What’s common, whats’s different?
IIRC there was also a list of topology attributes on the wiki (and probably now in the User Guide?) that explained what they were supposed to mean. Eg resnum vs resid. It would be important to get the semantics defined for once and all.
@orbeckst Below I compare MDAnalysis, PyMol and ProDy. I will try to add other comparisons in the coming days, but I already outlined the problems I see with MDAnalysis.
I edited earlier today the description of this issue to note that @lilyminium already added information about using insertion codes with the resid
selection in the User Guide - Simple Selections, which partially address the problem.
In the User Guide - Canonical Attributes it is clearly stated that resid
is the residue number and resnum
is an alias for resid
.
ATOM 1 N GLY A 1 14.785 -12.720 25.303 1.00 21.06 N
ATOM 2 H GLY A 1 15.188 -13.267 24.516 1.00 0.00 H
ATOM 3 CA GLY A 1 13.751 -13.323 26.187 1.00 21.45 C
ATOM 4 HA1 GLY A 1 13.969 -13.130 27.237 1.00 0.00 H
ATOM 5 HA2 GLY A 1 12.758 -12.944 25.947 1.00 0.00 H
ATOM 6 C GLY A 1 13.869 -14.830 25.859 1.00 21.97 C
ATOM 7 O GLY A 1 14.414 -15.179 24.801 1.00 22.20 O
ATOM 8 H LYS A 1A 25.735 -14.238 12.187 1.00 28.77 N
ATOM 9 A LYS A 1A 25.698 -14.467 13.201 1.00 0.00 H
ATOM 10 CA LYS A 1A 25.215 -12.920 11.783 1.00 28.80 C
ATOM 11 HA LYS A 1A 25.025 -12.972 10.711 1.00 0.00 H
ATOM 12 C LYS A 1A 26.223 -11.798 12.006 1.00 29.74 C
ATOM 13 O LYS A 1A 26.730 -11.602 13.135 1.00 28.72 O
ATOM 14 CB LYS A 1A 23.937 -12.618 12.497 1.00 30.64 C
ATOM 15 HB1 LYS A 1A 23.285 -13.448 12.225 1.00 0.00 H
ATOM 16 HB2 LYS A 1A 24.203 -12.697 13.551 1.00 0.00 H
ATOM 17 CG LYS A 1A 23.015 -11.397 12.444 1.00 31.59 C
ATOM 18 HG1 LYS A 1A 23.519 -10.518 12.845 1.00 0.00 H
ATOM 19 HG2 LYS A 1A 22.700 -11.198 11.420 1.00 0.00 H
ATOM 20 CD LYS A 1A 21.796 -11.779 13.326 1.00 31.47 C
ATOM 21 HD1 LYS A 1A 21.317 -12.644 12.868 1.00 0.00 H
ATOM 22 HD2 LYS A 1A 22.176 -12.055 14.310 1.00 0.00 H
ATOM 23 CE LYS A 1A 20.754 -10.743 13.523 1.00 29.40 C
ATOM 24 HE1 LYS A 1A 20.275 -10.535 12.566 1.00 0.00 H
ATOM 25 HE2 LYS A 1A 21.224 -9.834 13.898 1.00 0.00 H
ATOM 26 NZ LYS A 1A 19.715 -11.194 14.500 1.00 30.69 N
ATOM 27 HZ1 LYS A 1A 19.259 -12.059 14.145 1.00 0.00 H
ATOM 28 HZ2 LYS A 1A 20.165 -11.389 15.417 1.00 0.00 H
ATOM 29 HZ3 LYS A 1A 19.001 -10.446 14.615 1.00 0.00 H
ATOM 1 N GLN A 1B 30.861 12.781 18.536 1.00 44.73 N
ATOM 2 H GLN A 1B 30.423 12.110 19.199 1.00 0.00 H
ATOM 3 CA GLN A 1B 32.085 12.372 17.843 1.00 45.43 C
ATOM 4 HA GLN A 1B 32.589 13.273 17.493 1.00 0.00 H
ATOM 5 C GLN A 1B 31.893 11.457 16.657 1.00 46.22 C
ATOM 6 O GLN A 1B 31.622 11.923 15.525 1.00 46.88 O
TER
HETATM 1 C LIG 1 17.735 -17.178 22.612 1.00 0.00 C
HETATM 2 C LIG 1 18.543 -17.350 21.462 1.00 0.00 C
HETATM 3 C LIG 1 19.877 -17.046 21.558 1.00 0.00 C
HETATM 4 C LIG 1 20.401 -16.566 22.766 1.00 0.00 C
HETATM 5 C LIG 1 19.613 -16.358 23.893 1.00 0.00 C
HETATM 6 C LIG 1 20.084 -15.791 25.046 1.00 0.00 C
HETATM 7 C LIG 1 19.241 -15.568 26.114 1.00 0.00 C
HETATM 8 C LIG 1 17.893 -15.971 26.028 1.00 0.00 C
HETATM 9 C LIG 1 17.370 -16.510 24.879 1.00 0.00 C
HETATM 10 C LIG 1 18.205 -16.689 23.809 1.00 0.00 C
HETATM 11 N LIG 1 21.383 -15.285 25.104 1.00 0.00 N
HETATM 12 C LIG 1 21.898 -14.217 24.214 1.00 0.00 C
HETATM 13 C LIG 1 22.635 -16.213 25.496 1.00 0.00 C
HETATM 14 S LIG 1 15.909 -17.581 22.417 1.00 0.00 S
HETATM 15 O LIG 1 15.988 -18.013 21.010 1.00 0.00 O
HETATM 16 O LIG 1 14.998 -18.191 23.402 1.00 0.00 O
HETATM 17 N LIG 1 14.893 -15.970 22.485 1.00 0.00 N
HETATM 18 C LIG 1 15.193 -15.114 21.296 1.00 0.00 C
HETATM 19 C LIG 1 16.307 -14.167 21.752 1.00 0.00 C
HETATM 20 O LIG 1 16.212 -13.781 22.914 1.00 0.00 O
HETATM 21 C LIG 1 13.942 -14.354 20.879 1.00 0.00 C
HETATM 22 C LIG 1 13.232 -13.686 22.058 1.00 0.00 C
HETATM 23 C LIG 1 11.892 -13.133 21.595 1.00 0.00 C
HETATM 24 N LIG 1 11.218 -12.647 22.780 1.00 0.00 N
HETATM 25 C LIG 1 11.457 -11.554 23.483 1.00 0.00 C
HETATM 26 N LIG 1 12.359 -10.674 23.045 1.00 0.00 N
HETATM 27 N LIG 1 10.723 -11.336 24.566 1.00 0.00 N1+
HETATM 28 N LIG 1 17.280 -13.692 20.830 1.00 0.00 N
HETATM 29 C LIG 1 17.043 -14.033 19.327 1.00 0.00 C
HETATM 30 C LIG 1 18.421 -14.472 18.809 1.00 0.00 C
HETATM 31 C LIG 1 19.610 -13.417 19.194 1.00 0.00 C
HETATM 32 C LIG 1 19.696 -13.083 20.765 1.00 0.00 C
HETATM 33 C LIG 1 18.351 -12.680 21.310 1.00 0.00 C
HETATM 34 C LIG 1 16.593 -13.014 18.550 1.00 0.00 C
HETATM 35 C LIG 1 15.941 -13.425 17.179 1.00 0.00 C
HETATM 36 S LIG 1 15.783 -14.904 14.458 1.00 0.00 S
HETATM 37 O LIG 1 17.488 -11.922 16.287 1.00 0.00 O
HETATM 38 C LIG 1 17.059 -15.583 13.508 1.00 0.00 C
HETATM 39 C LIG 1 16.864 -13.556 14.739 1.00 0.00 C
HETATM 40 C LIG 1 16.825 -12.903 16.107 1.00 0.00 C
HETATM 41 C LIG 1 18.146 -14.734 13.451 1.00 0.00 C
HETATM 42 N LIG 1 18.049 -13.554 14.106 1.00 0.00 N
END
u.select_atoms("resid 1")
selects A/GLY/1
and /LIG/1
u.select_atoms("resid 1A")
selects only A/LYS/1A
u.select_atoms("resnum 1")
selects everythingu.select_atoms("resnum 1A")
fails with SelectionError
u.select_atoms("byres (around 2 resname LIG)")
selects everything
(around 2 resname LIG)
u.select_atoms("same residue as (around 2 resname LIG)")
selects everything
(around 2 resname LIG)
u.select_atoms("same resid as (around 2 resname LIG)")
selects everything
(around 2 resname LIG)
resid 1
only selects A/GLY/1
and /LIG/1
(around 2 resname LIG)
is to select atoms near /LIG/1
u.select_atoms("same resnum as (around 2 resname LIG)")
selects everythingu.select_atoms("resid 1")
selects A/GLY/1
and /LIG/1
u.select_atoms("resid 1A")
selects only A/LYS/1A
u.select_atoms("resnum 1")
selects everythingu.select_atoms("resnum 1A")
fails with SelectionError
u.select_atoms("byres (around 2 resname LIG)")
selects A/GLY/1
and A/LYS/1A
(around 2 resname LIG)
u.select_atoms("same residue as (around 2 resname LIG)")
selects A/GLY/1
and A/LYS/1A
(around 2 resname LIG)
u.select_atoms("same resid as (around 2 resname LIG)")
selects everything
(around 2 resname LIG)
resid 1
only selects A/GLY/1
and /LIG/1
(around 2 resname LIG)
is to select atoms near /LIG/1
u.select_atoms("same resnum as (around 2 resname LIG)")
selects everythingresid
behaves differently in u.select_atoms("resid 1")
(where resid
takes the insertion codes into account) and u.select_atoms("same resid as (around 2 resname LIG)")
(where insertion codes are ignored).
In u.select_atoms("resid 1")
, resid
is not interchangeable with resnum
; in u.select_atoms("same resid as (around 2 resname LIG)")
, resid
is interchangeable with resnum
.
In [3]: s1 = u.select_atoms("same resnum as (around 2 resname LIG)")
In [4]: s2 = u.select_atoms("same resid as (around 2 resname LIG)")
In [5]: s1 == s2
Out[5]: True
In [6]: s3 = u.select_atoms("resid 1")
In [7]: s4 = u.select_atoms("resnum 1")
In [8]: s3 == s4
Out[8]: False
select residue 1
selects both A/GLY/1
and /LIG/1
A/GLY/1
and /LIG/1
have the same residue number an no insertion codeA/LYS/1A
and A/GLN/1B
are not selectedselect residue 1A
selects only A/LYS/1A
select byres resname LIG around 2.0
selects A/GLY/1
and A/LYS/1A
/LIG/1
ProDy has an interesting take:
p.select("resnum 1")
selects everything
resnum
in MDanalysisp.select("resnum 1A")
selects only A/LYS/1A
p.select("resnum 1_")
selects A/GLY/1
and /LIG/1
_
represent explicitly the absence of insertion codep.select('same residue as exwithin 2.0 of resname LIG')
selects A/GLY/1
and A/LYS/1A
p.select('same residue as within 2.0 of resname LIG')
selects A/GLY/1
, A/LYS/1A
and /LIG/1
Question: is there a way to select an empty insertion code using the icode
selection keyword in MDAnalysis?
@RMeli no, I don't think there's a way to select an empty string. Thanks for the thorough summary of the issue!
Edit: I didn't see residue 1B in the test file there. The potential residues for selection in every point below are [Gly1, Lys1A, Lig1].
To add (all but mdtraj are from skimming google, so feel free to correct):
"select residue 1"
selects everything"select resid 2"
selects the ligand"resid 1"
and "resseq 1 and icode ' '"
(note space) selects all but 1A"resseq 1"
selects all"resid 1A"
selects 1Aatom.select(resno=[1])
selects allatom.select(resno=[1], insert=['A'])
selects 1Aatom.select(resno=[1], insert=[''])
selects all but 1A' '
(spaces), not ''
.Bonus
The keyword resid
variously refers to:
Summing up
It sounds like we need a special dtype for the resid array that then allows comparison and ordering of both integers (26) and strings (“26A”). All working on resids get coerced to the special dtype so it works as expected and all resids include icodes by default. I’ll see if I can make the struct in numpy...
On Sun, Apr 26, 2020 at 08:32, Rocco Meli notifications@github.com wrote:
Reopened #2308 https://github.com/MDAnalysis/mdanalysis/issues/2308.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2308#event-3272649246, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB4QEDWHQIZ2KDL6DJLROPPPXANCNFSM4IIWJPPQ .
I using a data set of ~16k PDB files and many of them contain insertion codes (see PDB File Format). I found that some codes just ignore (or forget) them and therefore I'm being extra careful whit my analysis. I noticed that MDAnalysis also shows some strange behavior.
Consider the following PDB file, where there are two different residues with the same residue number but with and without insertion code:
If I load this PDB file
all the three residues are correctly identified:
We see that
resid
andresnum
have the same value. However, a selection usingresid
andresnum
gives different results:The
resid
keyword only selects the residue 30 without insertion code, while theresnum
keyword select both residues with residue number 30. This is because,resid
appears to be sensitive to insertion codes:which is a nice and short alternative to
The alternative is particularly useful with f-strings:
resid {res.num}{res.icode}
woks even ificode == ""
.The questions I have the following:
res.resid
return the full residue name (residue number and insertion code)?Personally, I find odd that
u.select_atoms("resid {res.resid}")
might or might not return the correct residue when if insertion codes are involved. I would be happy to provide a fix if this is the case.Additionally, in the MDAnalysis Documentation - Selection Commands there is no mention of theicode
keyword and theresid
is described as a single residue number or a range of numbers (without mention to the insertion code functionality).Should the documentation be clearer about the difference betweenresid
andresnum
, and explicitly mention theicode
selection keyword as well?Again, I would be happy to provide an improvement.EDIT 25/04/2020: This is still the case in the MDAnalysis Documentation - Selection Commands, but the new MDAnalysis User Guide now mentions insertion codes explicitly. Thanks @lilyminium !
Credits: Useful discussions with @IAlibay.