Closed aggreen closed 7 years ago
Grrrrrr this seems to be another round of bizarre SIFTS entries (see the csv file).
pdb_id,pdb_chain,uniprot_ac,resseq_start,resseq_end,coord_start,coord_end,uniprot_start,uniprot_end 5o2u,A,P12493,1,500,None,None,1,500 5o2u,C,P12493,1,500,None,None,1,500
Through mapping (the PDB chain object is queried through seqres numbering), this leads to an empty chain, which in turn creates an empty distance map, leading to the broken indexing situation.
From the look of it, something is broken big-style in the SIFTS-provided pdb_chain_uniprot.csv files that form the basis of the mapping. The previous version from April looks like it was affected too, whereas the original version when I wrote this code doesn't contain a single "None". 65% of entries in the current file contain a None.
Anna, can you please write to the SIFTS people and check what is going on? It looks like a major issue on their side.
I'll also think about how to make the code more robust against issues like this later this week, but ultimately without a proper mapping table, there is no way out of this.
(+1 for nice error log!)
Interesting. Benni and I noticed this problem but we assumed that because there were Nones in the file from April that it was a normal behavior of the database (especially given that this error doesn't seem to have happened yet). I will email the SIFTS people now and CC you.
Anna
On Wed, Aug 9, 2017 at 3:21 PM, thomashopf notifications@github.com wrote:
Grrrrrr this seems to be another round of bizarre SIFTS entries (see the csv file).
pdb_id,pdb_chain,uniprot_ac,resseq_start,resseqend,coord start,coord_end,uniprot_start,uniprot_end 5o2u,A,P12493,1,500,None,None,1,500 5o2u,C,P12493,1,500,None,None,1,500
Through mapping (the PDB chain object is queried through seqres numbering), this leads to an empty chain, which in turn creates an empty distance map, leading to the broken indexing situation.
From the look of it, something is broken big-style in the SIFTS-provided pdb_chain_uniprot.csv files that form the basis of the mapping. The previous version from April looks like it was affected too, whereas the original version when I wrote this code doesn't contain a single "None". 65% of entries in the current file contain a None.
Anna, can you please write to the SIFTS people and check what is going on? It looks like a major issue on their side.
I'll also think about how to make the code more robust against issues like this later this week, but ultimately without a proper mapping table, there is no way out of this.
(+1 for nice error log!)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/debbiemarkslab/EVcouplings/issues/56#issuecomment-321355718, or mute the thread https://github.com/notifications/unsubscribe-auth/AHimXofJLawBfHlL_U91Lm-5bjvTPCTtks5sWgbAgaJpZM4OxP48 .
Okay Anna checked with SIFTS about the None entries, and they are intended like this (if there is no corresponding residue with coordinates, there will be None). This does not affect the Uniprot to SEQRES mapping however, which the code uses for the mapping (the SEQRES to PDB index mapping is based on the residue-level mapping contained in the MMTF files themselves).
So two things here:
1) Short-term solution - code needs to handle cases where the aligned SEQRES sequence does not have 3D coordinates. Only way to figure this out in the current situation is when the 3D structure is loaded and remapped - unfortunately this means the SIFTSResult might indicate there is structural information, which turns out to be false later on. This is what happened with 5o2u.
2) The only proper way out is to have a better SIFTS mapping file that maps between actually observed residue ranges and Uniprot ranges. At this point, these ranges could be checked when the SIFTSResult is generated. Maybe SIFTS would be willing to provide these files for us, or ultimately, we could make them ourselves based on the xml files. The design goal here is that the pipeline does not need access to the xml files.
Multimer distance calculation fails for hits to the protein P12497 (code and error message below). We believe this is due to pdb id 5o2u, as this is the only structure hit with multiple chains. Structure hits and mapping files are attached.
We have tried the following:
verified that sifts and pdb files exist and do not appear to have anything strange about them: http://files.rcsb.org/view/5o2u.pdb http://mmtf.rcsb.org/v1.0/full/5o2u
Tried calculating the distance map using the by_pdb_id method. This does not return any error.
Potentially of note: this error was originally returned on orchestra but I was able to reproduce it locally. Benni was not able to reproduce the error locally but got a different error that also seems to be related to missing coordinates.
Code:
Error:
nucleocapsid_b0.8_mapping1737.txt nucleocapsid_b0.8_mapping1738.txt nucleocapsid_b0.8_structure_hits.txt