biopython / biopython

Official git repository for Biopython (originally converted from CVS)
http://biopython.org/
Other
4.31k stars 1.74k forks source link

PDB: Calling __getitem__ with ints should work for non amino acids #2348

Open Tillsten opened 4 years ago

Tillsten commented 4 years ago

Setup

I am reporting a problem with Biopython version, Python version, and operating system as follows:

3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)] CPython Windows-10-10.0.14393-SP0 1.75

Expected behaviour

Assuming I have PDB chain with a non-amino acid at resseq=100. If I call chain[100] I would like to get that residue. Instead I get a KeyError, since the internally used def _translate_id(self, id) generates the wrong ID.

peterjc commented 4 years ago

Do you have a proposal which would fix this? Also a specific example, ideally from the PDB, would be very useful - ideally a small one if we were to use it as a test case.

Tillsten commented 4 years ago

The easy solution would be just to modifiy def _translate_id(self, id) to just loop through the whole chain if the inital try fails. Not tested modication:

    def _translate_id(self, id):        
        if isinstance(id, int):
            trans_id = (" ", id, " ")
            if trans_id in self:
                return trans_id
            else:
                for r in self:
                    if r.id[1] == id:
                        return r.id                        
        return id
JoaoRodrigues commented 4 years ago

Modified aminoacids are usually defined as HETATM (e.g. PCA in 5uiw), so the code is doing the right thing: you'd have to write chain[('H', 100, '')] for residue 1. The chain[100] shortcut is only there because most residues are canonical amino acids and both the hetatm and insertion code of the id tuple are blank, so it would be super annoying to type chain[('', 100, '')].

Further, the proposed solution raises a problem when two residues share the same number.

Do you have an example where a modified amino acid is defined as ATOM?

Tillsten commented 4 years ago

My point is that the shortcut works for ATOMs but not for HETATM. Supplying the full id works but is quite unpractical, as it is for normal amino acids. IMO If they have as resseq one should also be able to select them by that. This is also the way all molecular viewers I know work.

I don't what happens when two normal residues have the same ID but copying that behavior for HETATM should not be difficult.

If you still think the current behavior is as intended, feel free to close the issue.

JoaoRodrigues commented 4 years ago

Revisiting this. We could add a shortcut to try and find a HETATM if the ATOM is not found, so something like:

    def _translate_id(self, id):        
        if isinstance(id, int):
            # Try ATOM first
            trans_id = (" ", id, " ")
            if trans_id in self:
                return trans_id

            # Try HETATM
            trans_id = ("H", id, " ")
            if trans_id in self:
                return trans_id
        return id

This is not perfect (you cannot use it to select water molecules for instance) and in case there is a conflict (HETATM and ATOM with the same resid) it defaults to the ATOM, so it might be confusing sometimes. But it would be more practical in many situations. I checked a few PDB files manually and there wasn't overlap in numbering.