semantics for bio-atomspace 2.0

mjsduncan commented 4 years ago

this issue will develop the custom nodes and semantics for optimally refactoring the bio-atomspace for atomspace 1.x. to the extent this refactoring is optimal it's weaknesses should serve as a guide for what atomspace 2.0 needs to improve on in atomspace 1.x

linas commented 4 years ago

You guys keep providing very narrow answers, without actually answering the questions I'm asking. I'm not sure if I am being unclear, or if you just don't know what the answer is. If you don't know the answer: just say "I don't know".

In particular: Mike -- if the answer is that you don't know, provide me with the name of the person who does know -- is it Ben? Kent? Some biologist somewhere? Who is driving these design decisions? They aren't coming out of thin air ... someone made the technical decision that biogrid interactors are an interesting thing to query. Who is that person? Can I talk to them?

Again -- I want to know why biogrid interactors are interesting, but tetragons are not... !? There are literally thousands of different queries I can think of, and out of these thousands, you guys decided to write code for 2 or 3 of them. Why only those? What is wrong with the other 998 interesting queries that one could make? Who is the person that made the decision that only these 2 or 3 should be coded, and not the others? How did you actually arrive at this code base, instead of a different code base?

mjsduncan commented 4 years ago

@linus you have been told the exact answer by hedra and by me, you just aren't parsing it. for the third time, these are to answer specific annotation queries, that is the whole point of the system. that's why i keep suggestion you actually try the code as it was designed to be used, through it's browser interface, which you obviously haven't done. this is code for basic graph database functionality, not a research program.

linas commented 4 years ago

@mjsduncan I'm sorry I'm being mis-understood. I'm trying to phrase my questions in such a way that I can get them answered. You've provided replies, and thank you for that! None of the replies address the questions that I am asking, and I'm sorry, I guess I was being unclear. Let me try again, with a different appraoch: what, exactly, is the "whole point of the system", and who decided what that point is?

mjsduncan commented 4 years ago

the original point of the system was to demonstrate using opencog on singularitynet.io as a biomedical knowledge base, because singularitynet was willing to pay mozi to do that. the specific task was to show what information some common data sources (GO - a representative ontology, reactome - a representative pathway database, biogrid - a representative gene interaction database) contained that related to an input list of genes. now we're trying to expand that in various ways to make more general queries of more types of data possible - a one-knowledge-graph-to-rule-them-all. one use case is as a kind of inference control: make inference processes more efficient by producing a smaller knowledge graph that will still contain the info necessary to reach useful conclusions, based on user knowledge and hypergraph connections.

mjsduncan commented 4 years ago

@linas the point of the triangle and tetragram code, respectively, is to find the links between genes that link to an input gene, and to find the links between proteins that are expressed by input genes, since a gene can express more than one protein. the biogrid database has interaction info abstracted to the gene level only, that's one reason why STRING is superior as an interaction database, it uses the protein level info directly.

linas commented 4 years ago

@mjsduncan OK .. so .. the original version, I guess we should call it 1.0, hard-coded a handful of queries - for triangles, for pentagons, a few other things. For version 2.0 there seem to be two, or three options:

(a) Hard-code more queries for yet other kinds of relationships between assorted things.
(b) Remove the hard-coding, and instead, assemble search queries from some GUI - from some diagrams that the user specifies in some interactive system.
(c) Some combination of the two above.

My gut instinct says that (b) is better, although it might be hard, because it will take some script-fu to construct a valid pattern search based on things the user selected from menus and push-buttons (or even hand-drawn diagrams, if your GUI supports that)

But if you go for (a), then I'm back to the question in https://github.com/MOZI-AI/knowledge-import/issues/10#issuecomment-635096512 - what are the worth-while queries to code up, and how do you know what they are?

If you go for (b) then there are other interesting things that happen. One of the harder things will be "query optimization". For example, I can search for all tetragons, by brute force, or I can precompute (find in advance) all triangles, and then just add one more point to get the tetragon. I discovered, of course, that the brute-force approach is 10 slower .. the difference between hours and days, or days and weeks.

If you search for "query optimization" you will find thousands of blog entries talking about arcane tricks and overblown marketing hype one solving assorted big-data search problems, ... so this is not new territory, it's endemic to big-data. However, it is new for the atomspace. And its also technically challenging (if it was easy, there wouldn't be blog entries and marketing hype) ... and so I'm interested in that...

mjsduncan commented 4 years ago

a specific query we're working on is "what drugs affect gene expression in a way that causally opposes the action of genes whose expression is changed by viral infection?" based on drug interaction databases, whose first-draft import scripts are here and here. there is also a biogrid scheme file with sars2 gene & inferred protein interactions here.

MOZI-AI / knowledge-import

semantics for bio-atomspace 2.0 #10