aqlaboratory / proteinnet

Standardized data set for machine learning of protein structure
MIT License
867 stars 132 forks source link

General discrepancies between ProteinNet and mmCIF/PDB files (biopython) #16

Closed thavlik closed 4 years ago

thavlik commented 4 years ago

I have some code that fetches mmCIF files for each entry in CASP11 using BioPython. A substantial proportion of examples fail the various checks, though some pass. Many of them are missing the corresponding model, and most that have the given model disagree on primary sequence length. Perhaps there is an obvious explanation for this, and I simply overlooked it. Subtracting one from model_id allows most of the models to be resolved, but many of the primary sequences have significant length mismatch. Most of the files only have the one model, so it's unclear what is exactly is going on here.

#!/usr/bin/python

# imports
import sys
import re

# Constants
NUM_DIMENSIONS = 3

# Functions for conversion from Mathematica protein files to TFRecords
_aa_dict = {
    'A': '0',
    'C': '1',
    'D': '2',
    'E': '3',
    'F': '4',
    'G': '5',
    'H': '6',
    'I': '7',
    'K': '8',
    'L': '9',
    'M': '10',
    'N': '11',
    'P': '12',
    'Q': '13',
    'R': '14',
    'S': '15',
    'T': '16',
    'V': '17',
    'W': '18',
    'Y': '19'
}
_dssp_dict = {
    'L': '0',
    'H': '1',
    'B': '2',
    'E': '3',
    'G': '4',
    'I': '5',
    'T': '6',
    'S': '7'
}
_mask_dict = {'-': '0', '+': '1'}

class switch(object):
    """Switch statement for Python, based on recipe from Python Cookbook."""
    def __init__(self, value):
        self.value = value
        self.fall = False

    def __iter__(self):
        """Return the match method once, then stop"""
        yield self.match

    def match(self, *args):
        """Indicate whether or not to enter a case suite"""
        if self.fall or not args:
            return True
        elif self.value in args:  # changed for v1.5
            self.fall = True
            return True
        else:
            return False

def letter_to_num(string, dict_):
    """ Convert string of letters to list of ints """
    patt = re.compile('[' + ''.join(dict_.keys()) + ']')
    num_string = patt.sub(lambda m: dict_[m.group(0)] + ' ', string)
    num = [int(i) for i in num_string.split()]
    return num

def read_record(file_, num_evo_entries):
    """ Read a Mathematica protein record from file and convert into dict. """

    dict_ = {}

    while True:
        next_line = file_.readline()
        for case in switch(next_line):
            if case('[ID]' + '\n'):
                id_ = file_.readline()[:-1]
                dict_.update({'id': id_})
            elif case('[PRIMARY]' + '\n'):
                primary = letter_to_num(file_.readline()[:-1], _aa_dict)
                dict_.update({'primary': primary})
            elif case('[EVOLUTIONARY]' + '\n'):
                evolutionary = []
                for residue in range(num_evo_entries):
                    evolutionary.append(
                        [float(step) for step in file_.readline().split()])
                dict_.update({'evolutionary': evolutionary})
            elif case('[SECONDARY]' + '\n'):
                secondary = letter_to_num(file_.readline()[:-1], _dssp_dict)
                dict_.update({'secondary': secondary})
            elif case('[TERTIARY]' + '\n'):
                tertiary = []
                for axis in range(NUM_DIMENSIONS):
                    tertiary.append(
                        [float(coord) for coord in file_.readline().split()])
                dict_.update({'tertiary': tertiary})
            elif case('[MASK]' + '\n'):
                mask = letter_to_num(file_.readline()[:-1], _mask_dict)
                dict_.update({'mask': mask})
            elif case('\n'):
                return dict_
            elif case(''):
                return None

if __name__ == '__main__':
    from Bio.PDB.MMCIFParser import MMCIFParser
    from Bio.PDB.PDBParser import PDBParser
    from Bio.PDB import PDBIO, PDBList
    from Bio.PDB.Structure import Structure, Entity
    from Bio.PDB.Model import Model
    from Bio.PDB.Residue import Residue
    from Bio.PDB.Chain import Chain
    from Bio.PDB.Atom import Atom
    import numpy as np

    input_path = 'D:/casp11/training_30'
    print(f'Reading data from {input_path}')
    num_evo_entries = int(sys.argv[2]) if len(
        sys.argv) == 3 else 20  # default number of evo entries

    input_file = open(input_path, 'r')
    pdbl = PDBList()

    while True:
        record = read_record(input_file, num_evo_entries)
        if record is not None:
            id = record["id"]
            primary = record['primary']
            primary_len = len(primary)
            parts = id.split('_')
            if len(parts) != 3:
                # https://github.com/aqlaboratory/proteinnet/issues/1#issuecomment-375270286
                continue
            pdb_id = parts[0]
            path = pdbl.retrieve_pdb_file(pdb_id, pdir='pdb/', file_format='mmCif')
            parser = MMCIFParser()
            structure = parser.get_structure(pdb_id, path)
            model_id = int(parts[1])
            assert model_id >= 0
            chain_id = parts[2]
            if not model_id in structure.child_dict:
                print(f'{pdb_id} lacks model {model_id}, models: {structure.child_dict}')
                continue
            model = structure.child_dict[model_id]
            assert isinstance(model, Model)
            chain = model.child_dict[chain_id]
            if not chain_id in model.child_dict:
                print(f'{pdb_id} model {model_id} lacks chain {chain_id}, chains: {model.child_dict}')
                continue
            chain_len = len(chain.child_list)
            assert isinstance(chain, Chain)
            if chain_len != primary_len:
                print(f'{pdb_id} chain ({chain_len}) and primary ({primary_len}) lengths mismatch')
            else:
                print(f'{pdb_id} is all good!')
        else:
            input_file.close()
            break

Sampled output:

Reading data from D:/casp11/training_30
Structure exists: 'pdb/2yo0.cif'
2YO0 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4jrn.cif'
4JRN lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4hxf.cif'
4HXF lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2l0e.cif'
2L0E is all good!
Structure exists: 'pdb/2kxi.cif'
2KXI is all good!
Structure exists: 'pdb/2o57.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 8744.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 9031.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 9318.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 9580.
  warnings.warn(
2O57 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4lhr.cif'
4LHR lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/3zrg.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1020.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1024.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1027.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1097.
  warnings.warn(
3ZRG lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2k0m.cif'
2K0M is all good!
Structure exists: 'pdb/4hhx.cif'
4HHX lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4ld3.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1362.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1382.
  warnings.warn(
4LD3 lacks model 2, models: {0: <Model id=0>}
Structure exists: 'pdb/3wo6.cif'
3WO6 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2kkp.cif'
2KKP is all good!
Structure exists: 'pdb/4jja.cif'
4JJA lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4h5s.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1565.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1733.
  warnings.warn(
4H5S lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4aez.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 16578.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 16692.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 16725.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 16758.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 16819.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 16820.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain G is discontinuous at line 16868.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain H is discontinuous at line 16889.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 16895.
  warnings.warn(
4AEZ lacks model 3, models: {0: <Model id=0>}
Structure exists: 'pdb/4gdo.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3247.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3295.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3336.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 3362.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 3401.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 3403.
  warnings.warn(
4GDO lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/3up6.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 4854.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 4927.
  warnings.warn(
3UP6 lacks model 1, models: {0: <Model id=0>}
Downloading PDB structure '3UKW'...
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3473.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3717.
  warnings.warn(
3UKW lacks model 2, models: {0: <Model id=0>}
Downloading PDB structure '3SEO'...
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3510.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3511.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3519.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3539.
  warnings.warn(
3SEO lacks model 1, models: {0: <Model id=0>}

Related issue https://github.com/aqlaboratory/proteinnet/issues/13

Thanks for all the good work.

lucidrains commented 4 years ago

@thavlik did you ever find out what was going on here?

thavlik commented 4 years ago

I have not. Right now I'm preferring the information in the PDB files over what is supplied by ProteinNet.

lucidrains commented 4 years ago

@thavlik oh noes :( is there a nice place where you can download all the PDBs used by Casp12 in one compressed file?

alquraishi commented 4 years ago

@lucidrains if you'd like to download the original raw PDBs for CASP12 (just what CASP released, i.e. the ProteinNet test set), you can grab them from the CASP site itself, here.

@thavlik without seeing specific examples of the discrepancy (i.e. a sequence you're getting versus what ProteinNet has), it's difficult to tell what the issue is. My best guess is that you're taking the sequence of the resolved residues in a PDB file--i.e. you're concatenating a bunch of discontiguous segments. This would be incorrect and non-physical. Virtually all PDB files have missing residues and so the sequence in the PDB file has gaps in it. ProteinNet entries have the full sequences, which is the physical object that actually folds, along with mask fields that indicate which residues are missing from the structure, so that you can correctly map the sequence to the structure and also account for the missing residues in the loss function, etc. I am not 100% sure this is what's going on, but that's my best guess like I said. If you can paste a specific sequence from ProteinNet and your pipeline it would be easier to compare the two, and see if you are in fact concatenating discontiguous segments.

thavlik commented 4 years ago

@thavlik without seeing specific examples of the discrepancy (i.e. a sequence you're getting versus what ProteinNet has), it's difficult to tell what the issue is. My best guess is that you're taking the sequence of the resolved residues in a PDB file--i.e. you're concatenating a bunch of discontiguous segments. This would be incorrect and non-physical.

Code with example output is in the original issue. Some structures agree (e.g. 2KKP, 2L0E) but some (e.g. 4LHR) do not have the model specified by ProteinNet. In the examples specifying model id=1, is this referencing a modified version of ?

You bring up a very good point about naively concatenating the residues in the PDB file together. I think we've just been operating on the assumption that a specific ID should have immediate structural correspondence between the training example and PDB file. I will investigate more this weekend.

Thanks again

thavlik commented 4 years ago

In the examples specifying model id=1, is this referencing a modified version of ?

After more digging, this turned out to be correct. Many PDBs have waters and other garbage residues included in Model0ChainA. The ProteinNet record will reference a Model1ChainA, which does not exist in the PDB. It can be derived by normalizing Model0ChainA according to the primary sequence / mask. Here is a concrete example with 4JRN_1_A - it is the PDB primary sequence, followed by ProteinNet's primary sequence and mask:

SELVFEKADSGCVIGKRILAHMQELENSERLDRILTVAAWPPDVPKRFVSVTTGETRTLVRGAPLGSGGFATVYEATDVETNEELAVKVFMSEKEPTDETMLDLQRESSCYRNFSLAKTAKDAQESCRFMVPSDVVMLEGQPASTEVVIGLTTRWVPNYFLLMMRAEADMSKVISWVFGDASVNKSEFGLVVRMYLSSQAIKLVANVQAQGIVHTDIKPANFLLLKDGRLFLGDFGTYRINNSVGRGTPGYEPPERPGITYTFPTDAWQLGITLYCIWCKERPTPADGIWDYLHFADCPSTPELVQDLIRSLLNRDPQKRMLPLQALETAAFKEMDSVVKGAAQNFEQQ

GAHMSELVFEKADSGCVIGKRILAHMQEQIGQPQALENSERLDRILTVAAWPPDVPKRFVSVTTGETRTLVRGAPLGSGGFATVYEATDVETNEELAVKVFMSEKEPTDETMLDLQRESSCYRNFSLAKTAKDAQESCRFMVPSDVVMLEGQPASTEVVIGLTTRWVPNYFLLMMRAEADMSKVISWVFGDASVNKSEFGLVVRMYLSSQAIKLVANVQAQGIVHTDIKPANFLLLKDGRLFLGDFGTYRINNSVGRAIGTPGYEPPERPFQATGITYTFPTDAWQLGITLYCIWCKERPTPADGIWDYLHFADCPSTPELVQDLIRSLLNRDPQKRMLPLQALETAAFKEMDSVVKGAAQNFEQQEHLHTE
----++++++++++++++++++++++++-------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--+++++++++++----++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------

Put these all on one line in a text editor, visualize the relationship. The length of the normalized chain should be equal to the number of + signs in the mask.

HUGE thank you, @alquraishi. ProteinNet is a significant contribution to the community. I'll open another issue if there is any additional confusion.

lucidrains commented 4 years ago

@thavlik @alquraishi thank you both!