boscoh / uniprot

retrieve protein sequence identifiers and metadata from http://uniprot.org
67 stars 15 forks source link

added support for TrEMBL format uniprot accessions #6

Closed oxpeter closed 2 years ago

oxpeter commented 7 years ago

original regex only recognised SwissProt format, and use of string truncation to identify variants could lead to unwanted behavior, in which TrEMBL accessions were converted into legitimate (but different) SwissProt accessions.

This commit therefore both updated the regex to recognise TrEMBL strings, and created a new function to correctly parse the isoform variant number from any uniprot accession, replacing the old truncation function (ie, no more seqid[:6] ).

added function clean_uniprot() to this end.

also added clean_uniprot_list(), which allows taking a list of seqids and returning the appropriate accession list.

boscoh commented 2 years ago

Totally lost this over the years. But looks good to me. Great work!