MOZI-AI / annotation-scheme

Human Gene annotation service backend
GNU General Public License v3.0
3 stars 4 forks source link

Implement more link types. #117

Open linas opened 4 years ago

linas commented 4 years ago

Search performance could be improved by creating and using more biospecifc link and node types. For example:

PathwayNode "R-HSA-166663"  ;; -- holds taged pathways

This would allow quicker discovery of all pathways.

(LocationLink (Molecule "Uniprot:A0A075B6P5") (Concept "extracellular region"))

The above is instead of

(Evaluation (Predicate "has_location")
 (List (Molecule "Uniprot:A0A075B6P5") (Concept "extracellular region")))

Maybe even

LocationNode "extracellular region"

for specific named spatial locations

Even simple nodes for different tags would help:

(SMPNode "SMP0000055") ;; instead of (ConceptNode "SMP0000055")
(UniNode "Uniprot:P80404") ;; instead of (MoleculeNode "Uniprot:P80404")
(HSANode "R-HSA-114608") ;; instead of (ConceptNode "R-HSA-114608")
(BioGridNode "Bio:121885"); instead of (ConceptNode "Bio:121885")

The above would solve the need for regex searches that are currently used to find these things. But also one could have a RegexNode as described in opencog/atomspace#2474

As a general rule, any time one has a frequently-used EvaluationLink of the form

(EvaluationLink (Predicate "foo") (ListLink ...stuff..))

it would probably be an overall win to define a custom (FooLink ... stuff...) instead. Mostly this makes the atomspace a little smaller (fewer atoms) and pattern searches a little faster (less to explore). Whether or not this is worth it, I can't say. Maybe it would add extra complexity to other processing stages...

mjsduncan commented 4 years ago

i like the idea of data source specific nodes, then the node names could be the exact reference id that could be pasted to the url for that data source to programmatically access associated info.

i propose starting with:

ChEBINode and UniProtNode that inherit from MoleculeNode

and

ReactomeNode and SMPNode that inherit from a new PathwayNode

tanksha commented 4 years ago

@mjsduncan same will be applied for others

GoNode for Go terms, RNANode for RNA transcripts, CLNode for Cell types ?

Habush commented 4 years ago

@tanksha why not separate node types for each cell types instead of a single CLNode type? Won't cells be matched by their type in the search functions?

tanksha commented 4 years ago

@Habush CLNode is different, its for cell ontologies which is not in the bioAtomspace yet, Its like the GO ontologies (starts with GO:XXXXX) the cell ontologies or cell types (starts with CL:XXXXX)

mjsduncan commented 4 years ago

@tanksha for the other types i suggest: for RNAs there would be a refseqNode and a ensemblNode that inherit from moleculeNode for ontology concepts there would be ontologyNode with names "GO:xxxx" and "CL:yyyy" this way the node names can complete the reference url for it's respective database.

linas commented 4 years ago

@mjsduncan I'm suggesting distinct names, e.g. GoOntologyNode and ClOntologyNode so that a pointless string-search can be avoided during pattern search. Current search code looks to see if the first three bytes of the string are GO: which is massively inefficient during searches. There is a proposal to implement a RegexNode in the atomspace that would mostly fix this inefficiency; see opencog/atomspace#2474