BioPandas / biopandas

Working with molecular structures in pandas DataFrames
https://BioPandas.github.io/biopandas/
BSD 3-Clause "New" or "Revised" License
714 stars 121 forks source link

Using SIFTS data for renumbering residues to match the Uniprot sequence resids #110

Open mrauha opened 2 years ago

mrauha commented 2 years ago

Hi all,

stumbled upon this paper describing the mapping of PDB residue id's to the ones in the sequence deposited in Uniprot:

Frustrated by the inconsistencies in numbering, I'm writing some code to output pdb's with these Uniprot sequence matching id's, and using biopandas for the crunching.

The mmCIF's with the mapped residues can be downloaded from the url:

https://www.ebi.ac.uk/pdbe/entry-files/download/{pdb_id}_updated.cif"

The CIF file is nicely read with the mmCIF parser. The resid matching the one in Uniprot is in the column pdbx_sifts_xref_db_num, giving None for those without mapping to sequence, eg. ligands and the UNK's.

This paper/python code/webserver describes a similar thing using the SIFTS:

For the residues without a mapping, the residues are renumbered using an offset of 5k/50k so that there's no overlap with the new resids of amino acids.

However, occasionally a part of the chain is are UNK's, so I will implemented a way to use continuous numbering wrt the Uniprot mapped resids for these.

Work in progress - if there's an already existing way to do this, let me know :)

Ruibin-Liu commented 1 year ago

The missing residues are not matched, which is a caveat for some uses.