stumbled upon this paper describing the mapping of PDB residue id's to the ones in the sequence deposited in Uniprot:
Choudhary, P.; Anyango, S.; Berrisford, J.; Varadi, M.; Tolchard, J.; Velankar, S. Unified Access to up-to-Date Residue-Level Annotations from UniProt and Other Biological Databases for PDB Data via PDBx/mmCIF Files. bioRxiv, 2022, 2022.08.10.503473. https://doi.org/10.1101/2022.08.10.503473.
Frustrated by the inconsistencies in numbering, I'm writing some code to output pdb's with these Uniprot sequence matching id's, and using biopandas for the crunching.
The mmCIF's with the mapped residues can be downloaded from the url:
The CIF file is nicely read with the mmCIF parser. The resid matching the one in Uniprot is in the column pdbx_sifts_xref_db_num, giving None for those without mapping to sequence, eg. ligands and the UNK's.
This paper/python code/webserver describes a similar thing using the SIFTS:
Faezov, B.; Dunbrack, R. L., Jr. PDBrenum: A Webserver and Program Providing Protein Data Bank Files Renumbered according to Their UniProt Sequences. PLoS One 2021, 16 (7), e0253411. https://doi.org/10.1371/journal.pone.0253411.
For the residues without a mapping, the residues are renumbered using an offset of 5k/50k so that there's no overlap with the new resids of amino acids.
However, occasionally a part of the chain is are UNK's, so I will implemented a way to use continuous numbering wrt the Uniprot mapped resids for these.
Work in progress - if there's an already existing way to do this, let me know :)
Hi all,
stumbled upon this paper describing the mapping of PDB residue id's to the ones in the sequence deposited in Uniprot:
Frustrated by the inconsistencies in numbering, I'm writing some code to output pdb's with these Uniprot sequence matching id's, and using biopandas for the crunching.
The mmCIF's with the mapped residues can be downloaded from the url:
https://www.ebi.ac.uk/pdbe/entry-files/download/{pdb_id}_updated.cif"
The CIF file is nicely read with the mmCIF parser. The resid matching the one in Uniprot is in the column
pdbx_sifts_xref_db_num
, giving None for those without mapping to sequence, eg. ligands and the UNK's.This paper/python code/webserver describes a similar thing using the SIFTS:
For the residues without a mapping, the residues are renumbered using an offset of 5k/50k so that there's no overlap with the new resids of amino acids.
However, occasionally a part of the chain is are UNK's, so I will implemented a way to use continuous numbering wrt the Uniprot mapped resids for these.
Work in progress - if there's an already existing way to do this, let me know :)