BioComputingUP / IDP-KG

Scripts and notebooks for generating and analysing the IDP-KG.
https://biocomputingup.github.io/IDP-KG/
Apache License 2.0
0 stars 2 forks source link

Query to find IDPs missing manual curation in DisProt #24

Open AlasdairGray opened 2 years ago

AlasdairGray commented 2 years ago

Which proteins have predicted disordered regions in MobiDB, but are not yet annotated manually in DisProt. The same for PED.

Basically an intersection and union between the resources in order to pinpoint which proteins are missing annotation in DisProt.

AlasdairGray commented 2 years ago

I've got a first draft of this query working which identifies 663 proteins in the IDP-KG that are in MobiDB or PED but are not in DisProt.

@ivanmicetic what information would be useful to return? Below are some results from the query. If you look at rows 9 and 66, are the UniProt IDs from PED useful here (some entries will have more than one) or would it be more useful to return the PED ID?

  protein pName sourceIDs
1 https://idpcentral.org/id/A0A0H2W778   "https://identifiers.org/mobidb:A0A0H2W778"
2 https://idpcentral.org/id/A1L0Z0 "Mediator of RNA polymerase II transcription subunit 1" "https://identifiers.org/mobidb:A1L0Z0"
3 https://idpcentral.org/id/A1Z9S6   "https://identifiers.org/mobidb:A1Z9S6"
4 https://idpcentral.org/id/A8MT69 "Centromere protein X" "https://identifiers.org/mobidb:A8MT69"
5 https://idpcentral.org/id/H0USY8   "https://identifiers.org/mobidb:H0USY8"
6 https://idpcentral.org/id/O00255 "Menin" "https://identifiers.org/mobidb:O00255"
7 https://idpcentral.org/id/O00267 "Transcription elongation factor SPT5" "https://identifiers.org/mobidb:O00267"
8 https://idpcentral.org/id/O00327 "Aryl hydrocarbon receptor nuclear translocator-like protein 1" "https://identifiers.org/mobidb:O00327"
9 https://idpcentral.org/id/O00401 "Neural Wiskott-Aldrich syndrome protein" "https://identifiers.org/mobidb:O00401,https://identifiers.org/uniprot:O00401"
10 https://idpcentral.org/id/O00482 "Nuclear receptor subfamily 5 group A member 2" "https://identifiers.org/mobidb:O00482"
66 https://idpcentral.org/id/O75475 "PC4 and SFRS1-interacting protein" "https://identifiers.org/mobidb:O75475,https://identifiers.org/uniprot:Q7Z3K3,https://identifiers.org/uniprot:Q7Z3K3,https://identifiers.org/uniprot:O75475,https://identifiers.org/uniprot:O75475,https://identifiers.org/uniprot:Q7Z3K3,https://identifiers.org/uniprot:Q7Z3K3,https://identifiers.org/uniprot:O75475,https://identifiers.org/uniprot:O75475"
ivanmicetic commented 2 years ago

Well, we are interested in proteins which are present or excluded among datasets. Therefore, I would prefer UniProt IDs instead of internal resource IDs (DisProt/PED/MobiDB (which are the same as UniProt))