MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.28k stars 642 forks source link

inconsistent handling of PDB Insertion Codes and resid #2308

Open RMeli opened 5 years ago

RMeli commented 5 years ago

I using a data set of ~16k PDB files and many of them contain insertion codes (see PDB File Format). I found that some codes just ignore (or forget) them and therefore I'm being extra careful whit my analysis. I noticed that MDAnalysis also shows some strange behavior.

Consider the following PDB file, where there are two different residues with the same residue number but with and without insertion code:

ATOM    392  N   LEU A  29      26.918 -15.013  -3.000  1.00 18.00           N
ATOM    393  H   LEU A  29      26.291 -15.692  -3.477  1.00  0.00           H
ATOM    394  CA  LEU A  29      26.474 -13.636  -2.851  1.00 17.76           C
ATOM    395  HA  LEU A  29      27.360 -13.024  -2.682  1.00  0.00           H
ATOM    396  C   LEU A  29      25.563 -13.475  -1.650  1.00 17.80           C
ATOM    397  O   LEU A  29      25.026 -12.400  -1.444  1.00 17.77           O
ATOM    398  CB  LEU A  29      25.771 -13.127  -4.152  1.00 16.97           C
ATOM    399  HB1 LEU A  29      24.803 -13.622  -4.236  1.00  0.00           H
ATOM    400  HB2 LEU A  29      25.621 -12.051  -4.064  1.00  0.00           H
ATOM    401  CG  LEU A  29      26.567 -13.401  -5.431  1.00 17.75           C
ATOM    402  HG  LEU A  29      26.698 -14.483  -5.455  1.00  0.00           H
ATOM    403  CD1 LEU A  29      25.796 -12.941  -6.688  1.00 16.58           C
ATOM    404 1HD1 LEU A  29      24.849 -13.478  -6.749  1.00  0.00           H
ATOM    405 2HD1 LEU A  29      25.604 -11.870  -6.623  1.00  0.00           H
ATOM    406 3HD1 LEU A  29      26.392 -13.152  -7.576  1.00  0.00           H
ATOM    407  CD2 LEU A  29      27.971 -12.768  -5.434  1.00 17.24           C
ATOM    408 1HD2 LEU A  29      27.881 -11.686  -5.333  1.00  0.00           H
ATOM    409 2HD2 LEU A  29      28.549 -13.165  -4.600  1.00  0.00           H
ATOM    410 3HD2 LEU A  29      28.473 -13.006  -6.372  1.00  0.00           H
ATOM    411  N   LEU A  30      25.385 -14.539  -0.852  1.00 19.19           N
ATOM    412  H   LEU A  30      25.849 -15.442  -1.076  1.00  0.00           H
ATOM    413  CA  LEU A  30      24.520 -14.423   0.357  1.00 19.64           C
ATOM    414  HA  LEU A  30      23.626 -13.869   0.070  1.00  0.00           H
ATOM    415  C   LEU A  30      25.272 -13.661   1.477  1.00 20.27           C
ATOM    416  O   LEU A  30      26.235 -14.195   2.067  1.00 21.08           O
ATOM    417  CB  LEU A  30      24.083 -15.795   0.874  1.00 19.89           C
ATOM    418  HB1 LEU A  30      23.672 -16.353   0.033  1.00  0.00           H
ATOM    419  HB2 LEU A  30      24.968 -16.307   1.251  1.00  0.00           H
ATOM    420  CG  LEU A  30      23.021 -15.790   2.004  1.00 20.22           C
ATOM    421  HG  LEU A  30      23.443 -15.166   2.792  1.00  0.00           H
ATOM    422  CD1 LEU A  30      21.673 -15.190   1.518  1.00 23.01           C
ATOM    423 1HD1 LEU A  30      21.832 -14.163   1.189  1.00  0.00           H
ATOM    424 2HD1 LEU A  30      21.291 -15.784   0.688  1.00  0.00           H
ATOM    425 3HD1 LEU A  30      20.954 -15.204   2.337  1.00  0.00           H
ATOM    426  CD2 LEU A  30      22.754 -17.189   2.582  1.00 22.25           C
ATOM    427 1HD2 LEU A  30      22.393 -17.845   1.790  1.00  0.00           H
ATOM    428 2HD2 LEU A  30      23.678 -17.592   2.996  1.00  0.00           H
ATOM    429 3HD2 LEU A  30      22.002 -17.119   3.368  1.00  0.00           H
ATOM    430  N   ASN A  30A     24.832 -12.436   1.742  1.00 21.44           N
ATOM    431  H   ASN A  30A     24.010 -12.091   1.207  1.00  0.00           H
ATOM    432  CA  ASN A  30A     25.424 -11.537   2.735  1.00 21.84           C
ATOM    433  HA  ASN A  30A     26.512 -11.534   2.668  1.00  0.00           H
ATOM    434  C   ASN A  30A     24.963 -12.022   4.124  1.00 22.73           C
ATOM    435  O   ASN A  30A     23.765 -12.131   4.358  1.00 23.27           O
ATOM    436  CB  ASN A  30A     24.896 -10.119   2.493  1.00 21.37           C
ATOM    437  HB1 ASN A  30A     25.152  -9.818   1.477  1.00  0.00           H
ATOM    438  HB2 ASN A  30A     23.812 -10.124   2.607  1.00  0.00           H
ATOM    439  CG  ASN A  30A     25.477  -9.123   3.452  1.00 23.63           C
ATOM    440  OD1 ASN A  30A     24.901  -8.853   4.524  1.00 25.25           O
ATOM    441  ND2 ASN A  30A     26.651  -8.605   3.113  1.00 19.86           N
ATOM    442 1HD2 ASN A  30A     27.131  -7.942   3.754  1.00  0.00           H
ATOM    443 2HD2 ASN A  30A     27.091  -8.862   2.206  1.00  0.00           H

If I load this PDB file

import MDAnalysis as mda
u = mda.Universe("test.pdb")
print(u.residues)
for res in u.residues:
    print(res.resid, res.resnum, res.resname, res.icode)

all the three residues are correctly identified:

<ResidueGroup [<Residue LEU, 29>, <Residue LEU, 30>, <Residue ASN, 30>]>
29 29 LEU
30 30 LEU
30 30 ASN A

We see that resid and resnum have the same value. However, a selection using resid and resnum gives different results:

u.select_atoms("resid 30")
<AtomGroup with 19 atoms>
u.select_atoms("resnum 30")
<AtomGroup with 33 atoms>

The resid keyword only selects the residue 30 without insertion code, while the resnum keyword select both residues with residue number 30. This is because, resid appears to be sensitive to insertion codes:

u.select_atoms("resid 30A")
<AtomGroup with 14 atoms>

which is a nice and short alternative to

u.select_atoms("resnum 30 and icode A")
<AtomGroup with 14 atoms>

The alternative is particularly useful with f-strings: resid {res.num}{res.icode} woks even if icode == "".

The questions I have the following:

Personally, I find odd that u.select_atoms("resid {res.resid}") might or might not return the correct residue when if insertion codes are involved. I would be happy to provide a fix if this is the case.

Additionally, in the MDAnalysis Documentation - Selection Commands there is no mention of the icode keyword and the resid is described as a single residue number or a range of numbers (without mention to the insertion code functionality).

Again, I would be happy to provide an improvement.

EDIT 25/04/2020: This is still the case in the MDAnalysis Documentation - Selection Commands, but the new MDAnalysis User Guide now mentions insertion codes explicitly. Thanks @lilyminium !


Credits: Useful discussions with @IAlibay.

orbeckst commented 5 years ago

Thanks for the useful summary.

We need to better define which quantities are "tags" that we just read from the file (e.g., icode is used in this way, resnum ought to behave in this way but I am not sure if it always does) and which ones have a universal meaning in MDAnalysis that is independent from the original file format. I think we were moving towards making the internal "ix" indices (such as atoms.ix and residues.ix) also properly available for the user, but at the moment, they are not accessible via the selection language. That leaves the selections with quantities that can have subtly different meanings in different file formats.

Summary of my ramblings: Make a suggestion how you think it should work, ideally with examples how it will work for data coming from different file formats (PDB, GRO, PSF, TPR, ...).

RMeli commented 5 years ago

I completely understand that different formats have different needs (and agree that the documentation could be clearer about universal quantities and tags). The only thing I found quite strange is that I can select a single residue with a particular resid but then the resid attribute of that selection is different:

import MDAnalysis as mda

u = mda.Universe("test.pdb")
ag = u.select_atoms("resid 30A")

print(ag)
print(ag.resids)
<AtomGroup [<Atom 39: N of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 40: H of type H of resname ASN, resid 30 and segid A and altLoc >, <Atom 41: CA of type C of resname ASN, resid 30 and segid A and altLoc >, ..., <Atom 50: ND2 of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 51: 1HD2 of type H of resname ASN, resid 30 and segid A and altLoc >, <Atom 52: 2HD2 of type H of resname ASN, resid 30 and segid A and altLoc >]>

[30 30 30 30 30 30 30 30 30 30 30 30 30 30]

As you can see, the residue 30A has been selected correctly, but then the resids of the atom group are different. Interestingly enough (since I didn't dig into the source code), the information is not completely lost:

ag.select_atoms("resid 30A")
<AtomGroup with 14 atoms>

But this is even more confusing, since looking at print(ag) one could be tempted to do the following:

ag.select_atoms("resid 30")
<AtomGroup with 0 atoms>

I guess my point is that the fact that the resid used for selection and the one printed out do not match is confusing and potentially error prone.

My suggestion would be to store full residue information in resid. Such information should allow to uniquely identify a single residue. This will of course depend on the format (I'm not knowledgeable in many of them...); for some formats this will simply be the equal to resnum but for other formats it could be slightly different. For PDB files, for example, it would be the residue number plus the insertion code (from the PDB File Format: The combination of residue numbering and insertion code defines the unique residue.)

richardjgowers commented 5 years ago

Hey @RMeli I'm the one responsible for most of this behaviour, but I'm not really an expert on all these corner cases, so I'm happy to be corrected! Btw if you do something like select_atoms('resid 30A-30C') it should select the range too.

WRT resnum - I don't actually know what this is, it's something that existed before my time on MDA, and I kept around for backwards compat reasons... Maybe we should remove it (ie alias it to resid).

Currently ICodes are treated as a completely separate attribute, so as independents as charge. Reading this has made me think maybe we should instead lump it together with Resids...

So currently resids are created by the file parsing creating one of these TopologyAttr objects:

https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/topologyattrs.py#L1322

And passing it to the Universe. Universe then monkeypatches AtomGroups to let them know that they have resids. (This is why if you load an XYZ file, the AtomGroups won't even have a .charge attribute). A Group can then query the TopologyAttr object via the get_x method.

But maybe PDB files (and anything else that provides both Resid and ICode) should instead create a single ResidAndICode attribute, which then could return 30A as .resid attribute. So the TopologyAttr would instead hold 2 numpy arrays.

So things that should get changed

If you want to work on any of those I'm happy to help you with it.

RMeli commented 5 years ago

Hi @richardjgowers. I just tested

u.select_atoms("resid 30-30B")

and it works as expected (even if the first residue has no insertion code), which is great! I'm more and more convinced that resids behave as expected, but the residue.resid attribute is simply not a faithful representation.

I didn't know resnum was there only for backward compatibility. I thought the idea was that resid is an unique residue identifier while resnum is the residue number (which is some formats, such as PDB, does not uniquely identify a residue). What do you think of having them both, but with a clear distinction? resid could be defined as an unique identifier of a single residue (or a group of mutually exclusive residues, if there are alternative locations...) and resnum as the residue number (equal to resid in some formats). resnum could be useful to select all the residues with same residue number but with different (or no) insertion code.

The downside of this approach is that resid could be a string or a number, depending on the format; how do you feel about that?

I would be happy to work on this improvement with some guidance, but at the moment I'm quite busy with the last GSoC leg with another organisation and therefore it will have to wait until September.

richardjgowers commented 5 years ago

As a weird hack you could change the get method of resids to check if there's icodes and return a version with icodes appended. It gets messy because then the return type has to be strings, whereas I think a lot of people might expect an integer array. This is going to break a lot of code where people are doing arithmetic with their resids.

FWIW resids are far from unique, we often see files where these have looped around.

I think what you're describing for resnum is what we currently have for resindex. Ie a number generated by MDAnalysis which enumerates the residues we think the topology represents.

RMeli commented 5 years ago

Yes, I think such a change would break a lot of code and could be introduced only in a major revision. However, to fix the broken codes it would be sufficient to switch from resid to resnum when storing such attributes (during selection they work as expected).

I don't need to add the change in an hacky way at the moment since I'm only using select_atoms which appears to work correctly from my tests (reported here). However, problems will arise if one extracts resids for later use and therefore I wanted to raise the problem.

If I understand correctly, resindex is an internal representation and therefore is different from resnum in the sense that resnum should still correspond to the numbering in the original file. But my understanding or assumptions could be completely wrong!

orbeckst commented 5 years ago

How does VMD handle resid vs resnum?

EDIT: And PyMOL, Chimera, ... ?

orbeckst commented 5 years ago

https://github.com/MDAnalysis/mdanalysis/issues/2308#issuecomment-517639072

For PDB files, for example, it would be the residue number plus the insertion code (from the PDB File Format: The combination of residue numbering and insertion code defines the unique residue.)

The latter is not actually true: you also need a chain identifier because you can have the same resid in multiple chains. Furthermore, everything goes to hell in a hand basket when we deal with PDB files from simulations where resID wraps around after 9999, as @richardjgowers alluded to in https://github.com/MDAnalysis/mdanalysis/issues/2308#issuecomment-517699188.

MDAnalysis has to deal with the problem that formats like PDB are used in different contexts and everbody would like their files to behave the way that they expect. What's difficult for us is figuring out what these expectations are (especially when they differ from the published standard). Therefore the points that you're raising are very valuable!

RMeli commented 5 years ago

Yes, there can be multiple chains of course. And now that you point this out, I find the wording on the PDB Format Specification somewhat misleading...

I completely understand that there are many different contexts for PDB files alone, this is why I opened this discussion as Question and not as Issue. ; )

I wanted to raise the awareness that the following code is somewhat problematic:

ATOM    428 2HD2 LEU A  30      23.678 -17.592   2.996  1.00  0.00           H
ATOM    429 3HD2 LEU A  30      22.002 -17.119   3.368  1.00  0.00           H
ATOM    430  N   ASN A  30A     24.832 -12.436   1.742  1.00 21.44           N
ATOM    431  H   ASN A  30A     24.010 -12.091   1.207  1.00  0.00           H
ATOM    432  CA  ASN A  30B     25.424 -11.537   2.735  1.00 21.84           C
ATOM    433  HA  ASN A  30B     26.512 -11.534   2.668  1.00  0.00           H
import MDAnalysis as mda

u = mda.Universe("test.pdb")

ag = u.select_atoms("resid 30A")
print(ag)

resids = set(ag.resids)

for resid in resids:
    agw = u.select_atoms(f"resid {resid}")
    print(agw)
<AtomGroup [<Atom 3: N of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 4: H of type H of resname ASN, resid 30 and segid A and altLoc >]>
<AtomGroup [<Atom 1: 2HD2 of type H of resname LEU, resid 30 and segid A and altLoc >, <Atom 2: 3HD2 of type H of resname LEU, resid 30 and segid A and altLoc >]>

and find out if this behaviour is expected/wanted or not.

Maybe it's just a matter of updating the documentation to make clear that resid can be used to select residues with insertion codes but the resid attribute does not contain that information and it has to be accessed explicitly with icode?

orbeckst commented 4 years ago

@RMeli thanks – not much I could do since then but I am adding this issue to a growing list of interconnected problems. Your comments are much appreciated, sorry that we haven't been doing anything since.

RMeli commented 4 years ago

No pronlem @orbeckst, I've been following a bit the related issue and completely understand that there is much to be done and little time. I too have been quite busy lately, but I'll be happy to help where I can!

lilyminium commented 4 years ago

Replying to @RMeli https://github.com/MDAnalysis/mdanalysis/issues/2672#issuecomment-619344384 here because it seems more relevant:

I think suddenly making resid string-type would break a lot of code out there, including mine. 🙃

If a new topology attribute is added, it's probably best that the special resid-icode selection get moved to that token to avoid the mismatch here between the selection string and the actual attribute.

Would it ever be an issue if the new ResLabels didn't match str(Resids)+(Icodes)? Currently all other topology attributes are independent of each other. For example someone tweaks their icodes attribute and expects that to show up in the reslabels and selection results.

Miscellaneous notes:

RMeli commented 4 years ago

I agree changing resid to a string is not a viable option (and will have consequences on performance). The problem I was rising here is that resid works correctly on selection, but the attribute stored does not match the selection (resid 1A will correctly select residue 1A, but the resid attribute will only contain 1).

I think issues #2669 and #2672 can be solved by using resindices internally, but the problem of resid not being bijective remains.

PS: I usually compare my selections with PyMol, which seems to act as I expect.

lilyminium commented 4 years ago

Ah, we may have been talking past each other a little. From my perspective then the resid attribute values are fine as is. Instead, it's ResidSelection that has unexpected behaviour, and selecting "resid 1A" should give an error; you should be required to select "resid 1 and icode A" for the 1A residue. Of course this will annoy anyone who uses the resid shorthand...

RMeli commented 4 years ago

Yes, I think I might be interpreting resid in a different way... =)

I interpret it as it is used in select_atoms where if you have residues 1, 1A and 1B the selection resid 1 will only select 1 (in contrast, resnum 1 will select all three). That is, for me resid should be an unique identifier.

They way you interpret resid is the way I interpret resnum; selecting resnum 1A will indeed throw a SelectionError: Selection failed: 'Failed to parse number: 1A' while resnum 1 and icode A works as expected.

orbeckst commented 4 years ago

More general question: do we have a comparison of how different commonly used (bio) comp chem software does selections? What’s common, whats’s different?

IIRC there was also a list of topology attributes on the wiki (and probably now in the User Guide?) that explained what they were supposed to mean. Eg resnum vs resid. It would be important to get the semantics defined for once and all.

RMeli commented 4 years ago

@orbeckst Below I compare MDAnalysis, PyMol and ProDy. I will try to add other comparisons in the coming days, but I already outlined the problems I see with MDAnalysis.

I edited earlier today the description of this issue to note that @lilyminium already added information about using insertion codes with the resid selection in the User Guide - Simple Selections, which partially address the problem.

In the User Guide - Canonical Attributes it is clearly stated that resid is the residue number and resnum is an alias for resid.


Test File

ATOM   1     N   GLY A   1      14.785 -12.720  25.303  1.00 21.06           N
ATOM   2     H   GLY A   1      15.188 -13.267  24.516  1.00  0.00           H
ATOM   3     CA  GLY A   1      13.751 -13.323  26.187  1.00 21.45           C
ATOM   4     HA1 GLY A   1      13.969 -13.130  27.237  1.00  0.00           H
ATOM   5     HA2 GLY A   1      12.758 -12.944  25.947  1.00  0.00           H
ATOM   6     C   GLY A   1      13.869 -14.830  25.859  1.00 21.97           C
ATOM   7     O   GLY A   1      14.414 -15.179  24.801  1.00 22.20           O
ATOM   8     H   LYS A   1A     25.735 -14.238  12.187  1.00 28.77           N
ATOM   9     A   LYS A   1A     25.698 -14.467  13.201  1.00  0.00           H
ATOM   10    CA  LYS A   1A     25.215 -12.920  11.783  1.00 28.80           C
ATOM   11    HA  LYS A   1A     25.025 -12.972  10.711  1.00  0.00           H
ATOM   12    C   LYS A   1A     26.223 -11.798  12.006  1.00 29.74           C
ATOM   13    O   LYS A   1A     26.730 -11.602  13.135  1.00 28.72           O
ATOM   14    CB  LYS A   1A     23.937 -12.618  12.497  1.00 30.64           C
ATOM   15    HB1 LYS A   1A     23.285 -13.448  12.225  1.00  0.00           H
ATOM   16    HB2 LYS A   1A     24.203 -12.697  13.551  1.00  0.00           H
ATOM   17    CG  LYS A   1A     23.015 -11.397  12.444  1.00 31.59           C
ATOM   18    HG1 LYS A   1A     23.519 -10.518  12.845  1.00  0.00           H
ATOM   19    HG2 LYS A   1A     22.700 -11.198  11.420  1.00  0.00           H
ATOM   20    CD  LYS A   1A     21.796 -11.779  13.326  1.00 31.47           C
ATOM   21    HD1 LYS A   1A     21.317 -12.644  12.868  1.00  0.00           H
ATOM   22    HD2 LYS A   1A     22.176 -12.055  14.310  1.00  0.00           H
ATOM   23    CE  LYS A   1A     20.754 -10.743  13.523  1.00 29.40           C
ATOM   24    HE1 LYS A   1A     20.275 -10.535  12.566  1.00  0.00           H
ATOM   25    HE2 LYS A   1A     21.224  -9.834  13.898  1.00  0.00           H
ATOM   26    NZ  LYS A   1A     19.715 -11.194  14.500  1.00 30.69           N
ATOM   27    HZ1 LYS A   1A     19.259 -12.059  14.145  1.00  0.00           H
ATOM   28    HZ2 LYS A   1A     20.165 -11.389  15.417  1.00  0.00           H
ATOM   29    HZ3 LYS A   1A     19.001 -10.446  14.615  1.00  0.00           H
ATOM   1     N   GLN A   1B     30.861  12.781  18.536  1.00 44.73           N
ATOM   2     H   GLN A   1B     30.423  12.110  19.199  1.00  0.00           H
ATOM   3     CA  GLN A   1B     32.085  12.372  17.843  1.00 45.43           C
ATOM   4     HA  GLN A   1B     32.589  13.273  17.493  1.00  0.00           H
ATOM   5     C   GLN A   1B     31.893  11.457  16.657  1.00 46.22           C
ATOM   6     O   GLN A   1B     31.622  11.923  15.525  1.00 46.88           O
TER
HETATM    1  C   LIG     1      17.735 -17.178  22.612  1.00  0.00           C  
HETATM    2  C   LIG     1      18.543 -17.350  21.462  1.00  0.00           C  
HETATM    3  C   LIG     1      19.877 -17.046  21.558  1.00  0.00           C  
HETATM    4  C   LIG     1      20.401 -16.566  22.766  1.00  0.00           C  
HETATM    5  C   LIG     1      19.613 -16.358  23.893  1.00  0.00           C  
HETATM    6  C   LIG     1      20.084 -15.791  25.046  1.00  0.00           C  
HETATM    7  C   LIG     1      19.241 -15.568  26.114  1.00  0.00           C  
HETATM    8  C   LIG     1      17.893 -15.971  26.028  1.00  0.00           C  
HETATM    9  C   LIG     1      17.370 -16.510  24.879  1.00  0.00           C  
HETATM   10  C   LIG     1      18.205 -16.689  23.809  1.00  0.00           C  
HETATM   11  N   LIG     1      21.383 -15.285  25.104  1.00  0.00           N  
HETATM   12  C   LIG     1      21.898 -14.217  24.214  1.00  0.00           C  
HETATM   13  C   LIG     1      22.635 -16.213  25.496  1.00  0.00           C  
HETATM   14  S   LIG     1      15.909 -17.581  22.417  1.00  0.00           S  
HETATM   15  O   LIG     1      15.988 -18.013  21.010  1.00  0.00           O  
HETATM   16  O   LIG     1      14.998 -18.191  23.402  1.00  0.00           O  
HETATM   17  N   LIG     1      14.893 -15.970  22.485  1.00  0.00           N  
HETATM   18  C   LIG     1      15.193 -15.114  21.296  1.00  0.00           C  
HETATM   19  C   LIG     1      16.307 -14.167  21.752  1.00  0.00           C  
HETATM   20  O   LIG     1      16.212 -13.781  22.914  1.00  0.00           O  
HETATM   21  C   LIG     1      13.942 -14.354  20.879  1.00  0.00           C  
HETATM   22  C   LIG     1      13.232 -13.686  22.058  1.00  0.00           C  
HETATM   23  C   LIG     1      11.892 -13.133  21.595  1.00  0.00           C  
HETATM   24  N   LIG     1      11.218 -12.647  22.780  1.00  0.00           N  
HETATM   25  C   LIG     1      11.457 -11.554  23.483  1.00  0.00           C  
HETATM   26  N   LIG     1      12.359 -10.674  23.045  1.00  0.00           N  
HETATM   27  N   LIG     1      10.723 -11.336  24.566  1.00  0.00           N1+
HETATM   28  N   LIG     1      17.280 -13.692  20.830  1.00  0.00           N  
HETATM   29  C   LIG     1      17.043 -14.033  19.327  1.00  0.00           C  
HETATM   30  C   LIG     1      18.421 -14.472  18.809  1.00  0.00           C  
HETATM   31  C   LIG     1      19.610 -13.417  19.194  1.00  0.00           C  
HETATM   32  C   LIG     1      19.696 -13.083  20.765  1.00  0.00           C  
HETATM   33  C   LIG     1      18.351 -12.680  21.310  1.00  0.00           C  
HETATM   34  C   LIG     1      16.593 -13.014  18.550  1.00  0.00           C  
HETATM   35  C   LIG     1      15.941 -13.425  17.179  1.00  0.00           C  
HETATM   36  S   LIG     1      15.783 -14.904  14.458  1.00  0.00           S  
HETATM   37  O   LIG     1      17.488 -11.922  16.287  1.00  0.00           O  
HETATM   38  C   LIG     1      17.059 -15.583  13.508  1.00  0.00           C  
HETATM   39  C   LIG     1      16.864 -13.556  14.739  1.00  0.00           C  
HETATM   40  C   LIG     1      16.825 -12.903  16.107  1.00  0.00           C  
HETATM   41  C   LIG     1      18.146 -14.734  13.451  1.00  0.00           C  
HETATM   42  N   LIG     1      18.049 -13.554  14.106  1.00  0.00           N  
END

MDAnalysis

Stable

PR #2673

Problem?

resid behaves differently in u.select_atoms("resid 1") (where resid takes the insertion codes into account) and u.select_atoms("same resid as (around 2 resname LIG)") (where insertion codes are ignored).

In u.select_atoms("resid 1"), resid is not interchangeable with resnum; in u.select_atoms("same resid as (around 2 resname LIG)"), resid is interchangeable with resnum.

In [3]: s1 = u.select_atoms("same resnum as (around 2 resname LIG)")

In [4]: s2 = u.select_atoms("same resid as (around 2 resname LIG)")

In [5]: s1 == s2
Out[5]: True

In [6]: s3 = u.select_atoms("resid 1")

In [7]: s4 = u.select_atoms("resnum 1")

In [8]: s3 == s4
Out[8]: False

PyMol

ProDy

ProDy has an interesting take:

Question: is there a way to select an empty insertion code using the icode selection keyword in MDAnalysis?

lilyminium commented 4 years ago

@RMeli no, I don't think there's a way to select an empty string. Thanks for the thorough summary of the issue!

Edit: I didn't see residue 1B in the test file there. The potential residues for selection in every point below are [Gly1, Lys1A, Lig1].

To add (all but mdtraj are from skimming google, so feel free to correct):

Bonus The keyword resid variously refers to:

Summing up

richardjgowers commented 4 years ago

It sounds like we need a special dtype for the resid array that then allows comparison and ordering of both integers (26) and strings (“26A”). All working on resids get coerced to the special dtype so it works as expected and all resids include icodes by default. I’ll see if I can make the struct in numpy...

On Sun, Apr 26, 2020 at 08:32, Rocco Meli notifications@github.com wrote:

Reopened #2308 https://github.com/MDAnalysis/mdanalysis/issues/2308.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2308#event-3272649246, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB4QEDWHQIZ2KDL6DJLROPPPXANCNFSM4IIWJPPQ .