NaegleLab / CoDIAC

Other
0 stars 0 forks source link

Get Domain annotations for a Uniprot ID #1

Closed knaegle closed 1 year ago

knaegle commented 1 year ago

Fetch domain annotations for a protein In order to map domain boundaries onto a structure, we need to know the reference sequence in Uniprot and a set of domains that represent the most comprehensive domain boundaries. Comprehensive domains meaning the key resource we plan to use for ProteomeScout (Interpro, e.g.)

Solution Given a uniprot domain, we want this to return:

  1. Protein Sequence
  2. protein tuples (as in ProteomeScout) with [name, domain_ID, start, stop] where start and stop are referenced to protein sequencing (i.e. ones based counting).
  3. Domain name should be unique and not numbered. For example, if there are two SH2 domains, the name should be SH2 for both (versus SH2_1 and SH2_2)

Describe alternatives you've considered We have mapped previously from ProteomeScout, which harmonized Uniprot and pfam. We want to move out of ProteomeScout, but also minimize the number of total databases we have to access.

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at.

Additional context Add any other context or screenshots about the feature request here.