MaastrichtU-IDS / d2s-project-template

đź“‹ Template to build CWL workflows to convert data to a RDF Knowledge Graph and deploy services.
https://d2s.semanticscience.org/
MIT License
5 stars 0 forks source link

Fix DATE entities with multiple gene symbols #2

Closed vemonet closed 4 years ago

vemonet commented 4 years ago

Only a handful, create invalid URIs

e.g. c(\"CALM1\", \"CALM2\", \"CALM3\")

Due to

Dataset Drug_name   Drug_ID(Stitch) Tissue  Cell_line_ID    Target(uniprot) Target(symbol)  Target_class    Pathway Pathway_size
U133A   leuprolide  CID000003911    Pituitary   NA  P30968  GNRHR   gpcr     Eukaryotic Translation Elongation  89
U133A   leuprolide  CID000003911    Pituitary   NA  P30968  GNRHR   gpcr     Growth hormone receptor signaling  41
U133A   leuprolide  CID000003911    Pituitary   NA  P30968  GNRHR   gpcr     Translation    163
U133A   leuprolide  CID000003911    Pituitary   NA  P30968  GNRHR   gpcr     Fatty acid, triacylglycerol, and ketone body metabolism    137

U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Transmission across Chemical Synapses  199
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Neuronal System    300
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Neurotransmitter Receptor Binding And Downstream Transmission In The  Postsynaptic Cell    144
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Activation of NMDA receptor upon glutamate binding and postsynaptic events 41
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Ras activation uopn Ca2+ infux through NMDA receptor   17
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   CREB phosphorylation through the activation of CaMKII  15
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Opioid Signalling  101
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Post NMDA receptor activation events   37
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   CREB phosphorylation through the activation of Ras 27
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   DARPP-32 events    32
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Phospholipase C-mediated cascade   63
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   G-protein mediated events  58
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Activation of Kainate Receptors upon glutamate binding 30
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   DAG and IP3 signaling  40
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   Calmodulin induced events  30
U133A   dibucaine   CID000003025    Caudatenucleus  NA  P62158  c("CALM1", "CALM2", "CALM3")    enzyme   EGFR interacts with phospholipase C-gamma  42
vemonet commented 4 years ago

We could totally remove the link between uniprot and gene symbol from the mapping.

In this case should we trust UniProt? Or GeneSymbol

Or we could keep the uniprot - genesymbol link (can be seen at another source for prot - gene association). Need to remove c( and ) from the value, then split

micheldumontier commented 4 years ago

hgnc should have all this, no?

vemonet commented 4 years ago

@micheldumontier Yes it has. Just that the relation was easy to extract (at the time), so I was thinking that adding an additional source would be beneficial. But I didn't know about those few cases. In this case, we will just rely on HGNC for the has_gene_product relation

I am still interested in checking why such behaviour for those gene symbols though