debbiemarkslab / EVcouplings

Evolutionary couplings from protein and RNA sequence alignments
http://evcouplings.org
Other
235 stars 76 forks source link

Create SIFTS mapping file with segments with structural coverage #64

Closed thomashopf closed 6 years ago

thomashopf commented 7 years ago

We had a response from the SIFTS team that they will provide an segment mapping file that maps consecutive Uniprot sequence segments to segments with structural coverage later this year or early next year. In the meantime, we shouldtry to create such a file ourselves based on the residue-level xml files. We could then distribute these precomputed files using the evcouplings_dbupdate tool and as downloads on the lab website.

Requirements

Implementation

  1. Fetch xml files for entire PDB from ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/, best do this using rsync and write all parsing code in a way it only needs to look at updated entries between releases
  2. Write XML Parser that extracts segments and stores them in table; in the process note the following caveats (there will be more...):
    • PDB atom index can contain insertion codes (i.e. are strings)
    • One chain can contain different Uniprot entries (i.e. one chain can lead to multiple rows in the output table, and each one can have a different Uniprot identifier)
    • there is some initial parsing code in the old folding_dev pipeline, but the used XML parser uses insane amounts of RAM very quickly, so choice of parser should be done carefully.
aggreen commented 6 years ago

Any updates on this? Seems like we could just wait for their updated file to be provided at this point.

thomashopf commented 6 years ago

Can you please inquire with the SIFTS people about the status of this table (iirc, I cc'ed you on that email?)

sacdallago commented 6 years ago

Just for the time being @aggreen , don't kill me ahah :)

thomashopf commented 6 years ago

I wrote to the SIFTS people the other day to ask if there is any update on this. In case no such table should be available directly available from SIFTS, this is very high priority for correct functioning of the compare stage.

thomashopf commented 6 years ago

SIFTS will provide those tables for us, which is awesome. As soon as the table goes live on their website, I'll update the code to make full use of it.