Open mjsduncan opened 4 years ago
entrez_to_protein_2020-04-01.scm
is derived from entrez2uniprot.csv via gene2proteinMapping.py which is in turn the output of an R script, this needs to be replaced with a pipeline directly from a current UniProt source.
as an example,
entrez_to_protein_2020-04-01.scm
containswhile
codingRNA_2020-04-01.scm
containsif you look at this search of A6PWC8, you see that Q96NU1 and A0A087WX24 are different protein isoforms, that is they have different amino acid sequences, but Q96NUI has been verified by human curation and A0A087WX24 is an automated computational association.
depending on the analysis, only the curated version should be imported, or curated and computationally derived associations should be semantically distinguishable.