MOZI-AI / knowledge-import

Import scripts for the Bio-Atomspace
3 stars 6 forks source link

separate the import of or semantically distinguish computed vs curated UniProt proteins mapped to GeneNodes #24

Open mjsduncan opened 4 years ago

mjsduncan commented 4 years ago

as an example, entrez_to_protein_2020-04-01.scm contains

(EvaluationLink 
    (PredicateNode "expresses")
        (ListLink 
        (GeneNode "SAMD11")
        (MoleculeNode "Uniprot:A0A087WX24")
))

while codingRNA_2020-04-01.scm contains

(EvaluationLink 
    (PredicateNode "transcribed_to")
    (ListLink 
        (GeneNode "SAMD11")
        (MoleculeNode "ENST00000420190")))(EvaluationLink 
    (PredicateNode "translated_to")
    (ListLink 
        (MoleculeNode "ENST00000420190")
        (MoleculeNode "Uniprot:A6PWC8")))

if you look at this search of A6PWC8, you see that Q96NU1 and A0A087WX24 are different protein isoforms, that is they have different amino acid sequences, but Q96NUI has been verified by human curation and A0A087WX24 is an automated computational association.

depending on the analysis, only the curated version should be imported, or curated and computationally derived associations should be semantically distinguishable.

mjsduncan commented 4 years ago

entrez_to_protein_2020-04-01.scm is derived from entrez2uniprot.csv via gene2proteinMapping.py which is in turn the output of an R script, this needs to be replaced with a pipeline directly from a current UniProt source.