debbiemarkslab / EVcouplings

Evolutionary couplings from protein and RNA sequence alignments
http://evcouplings.org
Other
235 stars 76 forks source link

multimer_dists() fails for pdb 5o2u #56

Closed aggreen closed 7 years ago

aggreen commented 7 years ago

Multimer distance calculation fails for hits to the protein P12497 (code and error message below). We believe this is due to pdb id 5o2u, as this is the only structure hit with multiple chains. Structure hits and mapping files are attached.

We have tried the following:

  1. verified that sifts and pdb files exist and do not appear to have anything strange about them: http://files.rcsb.org/view/5o2u.pdb http://mmtf.rcsb.org/v1.0/full/5o2u

  2. Tried calculating the distance map using the by_pdb_id method. This does not return any error.

Potentially of note: this error was originally returned on orchestra but I was able to reproduce it locally. Benni was not able to reproduce the error locally but got a different error that also seems to be related to missing coordinates.

Code:

uid = 'P12497'
s = SIFTS("/Users/AG/databases/pdb_chain_uniprot_plus_2017_08_03.csv", 
    "/Users/AG/databases/pdb_chain_uniprot_plus_2017_08_03.fa")
selected_structures = s.by_alignment(
    reduce_chains=False, sequence_id=uid, region=(378, 432),
    jackhmmer="/Applications/hmmer-3.1b2-macosx-intel/binaries/jackhmmer",

)
c = multimer_dists(sifts_result=selected_structures,intersect=False)

Error:

IndexError                                Traceback (most recent call last)
<ipython-input-14-2902a00c3a21> in <module>()
----> 1 c = multimer_dists(sifts_result=selected_structures,intersect=False)
      2 c.contacts()

/Users/AG/marks_lab_scripts/EVcouplings/evcouplings/compare/distances.py in multimer_dists(sifts_result, structures, atom_filter, intersect, output_prefix, model, raise_missing)
    876             # is close in some combination)
    877             distmap_sym = DistanceMap.aggregate(
--> 878                 distmap, distmap.transpose(), intersect=intersect
    879             )
    880             distmap_sym.symmetric = True

/Users/AG/marks_lab_scripts/EVcouplings/evcouplings/compare/distances.py in aggregate(cls, intersect, agg_func, *matrices)
    597             )
    598 
--> 599             new_mat[k][i_agg, j_agg] = m.dist_matrix[i_src, j_src]
    600 
    601         # aggregate

IndexError: arrays used as indices must be of integer (or boolean) type

nucleocapsid_b0.8_mapping1737.txt nucleocapsid_b0.8_mapping1738.txt nucleocapsid_b0.8_structure_hits.txt

thomashopf commented 7 years ago

Grrrrrr this seems to be another round of bizarre SIFTS entries (see the csv file).

pdb_id,pdb_chain,uniprot_ac,resseq_start,resseq_end,coord_start,coord_end,uniprot_start,uniprot_end 5o2u,A,P12493,1,500,None,None,1,500 5o2u,C,P12493,1,500,None,None,1,500

Through mapping (the PDB chain object is queried through seqres numbering), this leads to an empty chain, which in turn creates an empty distance map, leading to the broken indexing situation.

From the look of it, something is broken big-style in the SIFTS-provided pdb_chain_uniprot.csv files that form the basis of the mapping. The previous version from April looks like it was affected too, whereas the original version when I wrote this code doesn't contain a single "None". 65% of entries in the current file contain a None.

Anna, can you please write to the SIFTS people and check what is going on? It looks like a major issue on their side.

I'll also think about how to make the code more robust against issues like this later this week, but ultimately without a proper mapping table, there is no way out of this.

(+1 for nice error log!)

aggreen commented 7 years ago

Interesting. Benni and I noticed this problem but we assumed that because there were Nones in the file from April that it was a normal behavior of the database (especially given that this error doesn't seem to have happened yet). I will email the SIFTS people now and CC you.

Anna

On Wed, Aug 9, 2017 at 3:21 PM, thomashopf notifications@github.com wrote:

Grrrrrr this seems to be another round of bizarre SIFTS entries (see the csv file).

pdb_id,pdb_chain,uniprot_ac,resseq_start,resseqend,coord start,coord_end,uniprot_start,uniprot_end 5o2u,A,P12493,1,500,None,None,1,500 5o2u,C,P12493,1,500,None,None,1,500

Through mapping (the PDB chain object is queried through seqres numbering), this leads to an empty chain, which in turn creates an empty distance map, leading to the broken indexing situation.

From the look of it, something is broken big-style in the SIFTS-provided pdb_chain_uniprot.csv files that form the basis of the mapping. The previous version from April looks like it was affected too, whereas the original version when I wrote this code doesn't contain a single "None". 65% of entries in the current file contain a None.

Anna, can you please write to the SIFTS people and check what is going on? It looks like a major issue on their side.

I'll also think about how to make the code more robust against issues like this later this week, but ultimately without a proper mapping table, there is no way out of this.

(+1 for nice error log!)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/debbiemarkslab/EVcouplings/issues/56#issuecomment-321355718, or mute the thread https://github.com/notifications/unsubscribe-auth/AHimXofJLawBfHlL_U91Lm-5bjvTPCTtks5sWgbAgaJpZM4OxP48 .

thomashopf commented 7 years ago

Okay Anna checked with SIFTS about the None entries, and they are intended like this (if there is no corresponding residue with coordinates, there will be None). This does not affect the Uniprot to SEQRES mapping however, which the code uses for the mapping (the SEQRES to PDB index mapping is based on the residue-level mapping contained in the MMTF files themselves).

So two things here:

1) Short-term solution - code needs to handle cases where the aligned SEQRES sequence does not have 3D coordinates. Only way to figure this out in the current situation is when the 3D structure is loaded and remapped - unfortunately this means the SIFTSResult might indicate there is structural information, which turns out to be false later on. This is what happened with 5o2u.

2) The only proper way out is to have a better SIFTS mapping file that maps between actually observed residue ranges and Uniprot ranges. At this point, these ranges could be checked when the SIFTSResult is generated. Maybe SIFTS would be willing to provide these files for us, or ultimately, we could make them ourselves based on the xml files. The design goal here is that the pipeline does not need access to the xml files.