SomaLogic / SomaScan.db

An R package providing extended biological annotations for the SomaScan Assay, a proteomics platform developed by SomaLogic Operating Co., Inc.
https://somalogic.github.io/SomaScan.db/
Other
2 stars 2 forks source link

Users can't select SwissProt IDs when using UNIPROT key #12

Open amanda-hi opened 9 months ago

amanda-hi commented 9 months ago

The SomaScan menu typically uses reviewed, manually curated UniProt IDs (aka "SwissProt" IDs) in its protein annotations. However, SomaScan.db currently returns all UniProt ID entries for a given protein, including non-reviewed, computationally annotated proteins (aka "TrEMBL" IDs). These different aspects of the UniProt knowledgebase are annotated on the UniProt website, but are not in SomaScan.db. Users are finding that they receive many more UniProt entries than expected in a given query, likely because they are also receiving TrEMBL IDs along with SwissProt IDs. Ideally, they should only be receiving SwissProt IDs, as those are the annotations presented in the SomaScan menu.

Example case where this is problematic: A SomaScan user discovered that if he mapped his seqIDs to UniProt using the SomaScandb package, 13,000 mappings were returned. This is because the UniProt database behind SomaScanDB does in fact contain multiple IDs for the same protein (TrEMBL IDs vs SwissProt IDs). The SBI/GSE reporter of this issue could foresee issues with assisting customers to ‘deduplicate’ this list and also reproduce the UniProt IDs in our menu.

Suggestion for solution: Is there metadata carried with UniProt database that would allow separation of the UniProt IDs to TrEMBL vs SwissProt? Keys that allow you to map ‘UniProt_all’, ‘UniProt_Trembl’ and ‘UniProt_SwissProt’ would be really helpful. Or at least, if not already in the git documentation, a line somewhere that alerts people to the multiple protein ID issue.

Timeline or deadline: A fix for this issue should be included in the 3.19 release of Bioconductor (in April/May 2024).