Get Domain annotations for a Uniprot ID

Fetch domain annotations for a protein In order to map domain boundaries onto a structure, we need to know the reference sequence in Uniprot and a set of domains that represent the most comprehensive domain boundaries. Comprehensive domains meaning the key resource we plan to use for ProteomeScout (Interpro, e.g.)

Solution Given a uniprot domain, we want this to return:

Protein Sequence
protein tuples (as in ProteomeScout) with [name, domain_ID, start, stop] where start and stop are referenced to protein sequencing (i.e. ones based counting).
Domain name should be unique and not numbered. For example, if there are two SH2 domains, the name should be SH2 for both (versus SH2_1 and SH2_2)

Describe alternatives you've considered We have mapped previously from ProteomeScout, which harmonized Uniprot and pfam. We want to move out of ProteomeScout, but also minimize the number of total databases we have to access.

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

[ ] Compare and decide on the access and type of domain boundaries useful for human proteins.
[ ] Add that access code to this repository and include testing conditions

Additional context Add any other context or screenshots about the feature request here.

NaegleLab / CoDIAC

Get Domain annotations for a Uniprot ID #1

Tasks