MOZI-AI / knowledge-import

Import scripts for the Bio-Atomspace
3 stars 6 forks source link

semantics for bio-atomspace 2.0 #10

Closed mjsduncan closed 4 years ago

mjsduncan commented 4 years ago

this issue will develop the custom nodes and semantics for optimally refactoring the bio-atomspace for atomspace 1.x. to the extent this refactoring is optimal it's weaknesses should serve as a guide for what atomspace 2.0 needs to improve on in atomspace 1.x

leungmanhin commented 4 years ago

Semi-related to this, in the TCMID that I'm working with, there's data denoting what ingredients / chemicals interact with a Uniprot molecule or a gene, e.g. "abietic acid" interacts_with "Uniprot:Q9Y6L6". And the usual practice at the moment is to replace it with it's corresponding ChEBI ID, so for example instead of generating:

EvaluationLink
  PredicateNode “interacts_with”
  ListLink
    MoleculeNode “abietic acid”
    MoleculeNode “Uniprot:Q9Y6L6”

it's preferred to have:

EvaluationLink
  PredicateNode “interacts_with”
  ListLink
    MoleculeNode “ChEBI:28987” ;; ChEBI ID for abietic acid
    MoleculeNode “Uniprot:Q9Y6L6”

However it's not always possible to find the ChEBI ID for a given ingredient in the dataset, sometimes it may be because it really doesn't exist there, sometimes it may be that there's a slight variation in its name so no exact match is found right away. In this case, we would try to find it in PubChem and put its PubChem ID in place of the ChEBI ID, e.g. "glycerin" doesn't have a ChEBI ID but a PubChem ID, we'll then have:

EvaluationLink
  PredicateNode “interacts_with”
  ListLink
    MoleculeNode “PubChem:753” ;; PubChem ID for glycerin
    MoleculeNode “Uniprot:Q9HAZ2”

As a result, we may have some EvaluationLinks with ChEBI ID, some with PubChem ID, and some with other IDs if we import more databases in the future. There are also ingredients that don't exist in any of the databases that we are working with, resulting in a MoleculeNode with the name of the ingredient (or not generating it at all), making it look a bit inconsistent. There may also be concern about which database we should use as the "main" one (e.g. ChEBI is preferred over PubChem?). So I'm wondering if we should just maintain our own IDs, and associate the external names/IDs/properties with it directly? For example:

EvaluationLink
  PredicateNode “interacts_with”
  ListLink
    MoleculeNode “AID:4321” ;; Our own ID for abietic acid
    MoleculeNode “AID:7654”

EvaluationLink
  PredicateNode “has_name”
  ListLink
    MoleculeNode “AID:4321”
    MoleculeNode “abietic acid” ;; or some other node type

EvaluationLink
  PredicateNode “has_uniprot_id”
  ListLink
    MoleculeNode “AID:7654”
    MoleculeNode “Uniprot:Q9Y6L6”

EvaluationLink
  PredicateNode “has_chebi_id”
  ListLink
    MoleculeNode “AID:4321”
    MoleculeNode “ChEBI:28987”

EvaluationLink
  PredicateNode “has_pubchem_id”
  ListLink
    MoleculeNode “AID:4321”
    MoleculeNode “PubChem:10569”

; ... etc

The downside is that more atoms will be generated this way but it seems cleaner and we just need to link more things to the same entity if available. What do you think?

noskill commented 4 years ago

@ngeiswei

mjsduncan commented 4 years ago

here are atomese translations of link types from an old version of the sif format used by cytoscape pathway commons had downloadable sif versions of different pathway dbs, including reactome. This includes indirect links between proteins, for instance CATALYSIS_PRECEDES which links two proteins that successively alter molecules in a metabolic pathway.

 REACTS-WITH
(EvaluationLink
        (PredicateNode "reacts with")
        (ListLink
          (ConceptNode "$P1")
          (ConceptNode "$P2"))
)

 INTERACTS-WITH
(EvaluationLink
        (PredicateNode "interacts with")
        (ListLink
                    (ProteinNode "$P1")
                    (ProteinNode "$P2"))
)

 CONTROLS-STATE-CHANGE-OF
(EquivalenceLink
    (EvaluationLink
    (PredicateNode "controls state change of")
    (ListLink
            (ProteinNode "$P1")
                (ProteinNode "$P2"))
    )
(PredictiveImplicationLink
    (EvaluationLink
    (PredicateNode "interacts with")
        (ListLink
            (ProteinNode "$P1")
                (ProteinNode "$P2"))
        )
    (EvaluationLink
                (PredicateNode "is modified")
            (ProteinNode "$P2")
    )
)
)

 CONTROLS-PHOSPHORYLATION-OF

(EquivalenceLink
(EvaluationLink
(PredicateNode "controls phosphorylation of")
(ListLink
                                (ProteinNode "$P1")
                                (ProteinNode "$P2"))
    )
(PredictiveImplicationLink
(EvaluationLink
                        (PredicateNode "interacts with")
                        (ListLink
                                    (ProteinNode "$P1")
                                    (ProteinNode "$P2"))
        )
(EvaluationLink
                        (PredicateNode "is phosphorylated")
                (ProteinNode "$P2")
        )
    )
)

 USED-TO-PRODUCE

(EquivalenceLink
    (EvaluationLink
    (PredicateNode "used to produce")
    (ListLink
            (ConceptNode "$P1")
                (ConceptNode "$P2"))
    )
(PredictiveImplicationLink
    (EvaluationLink
    (PredicateNode "reacts with")
    (ListLink
        (ConceptNode "$P1")
        (ConceptNode "$P2"))
    )
    (EvaluationLink
        (PredicateNode "quantity increased")
        (ConceptNode "$P2")
        )
    )
)

 CONTROLS-PRODUCTION-OF

(EquivalenceLink
(EvaluationLink
(PredicateNode "controls production of")
(ListLink
                                (ProteinNode "$P1")
                                (ConceptNode "$P2"))
    )
(PredictiveImplicationLink
(EvaluationLink
                        (PredicateNode "interacts with")
                        (ListLink
                                    (ProteinNode "$P1")
                                    (ConceptNode "$P2"))
        )
(EvaluationLink
                        (PredicateNode "quantity increased")
                (ConceptNode "$P2")
        )
    )
)

 CONSUMPTION-CONTROLLED-BY

(EquivalenceLink
(EvaluationLink
(PredicateNode "consumption controlled by")
(ListLink
                                (ConceptNode "$P1")
                                (ProteinNode "$P2"))
    )
(PredictiveImplicationLink
(EvaluationLink
                        (PredicateNode "interacts with")
                        (ListLink
                                    (ConceptNode "$P1")
                                    (ProteinNode "$P2"))
        )
(EvaluationLink
                        (PredicateNode "quantity decreased")
                (ConceptNode "$P2")
        )
    )
)

 CATALYSIS-PRECEDES

 (EquivalenceLink
   (EvaluationLink
       (PredicateNode "catalysis precedes")
           (ListLink
               (ConceptNode "$P1")
               (ConceptNode "$P2"))
    )
  (ExistsLink
     (VariableList
           (VariableNode "S1")
           (VariableNode "S2"))
   (AndLink
       (EvaluationLink
           (PredicateNode "controls production of")
               (ListLink
                      (ProteinNode "$P1")
                      (VariableNode "S1"))
        )
       (EvaluationLink
           (PredicateNode "used to produce")
               (ListLink
                      (VariableNode "S1")
                      (VariableNode "S2"))
        )
       (EvaluationLink
           (PredicateNode "consumption controlled by")
               (ListLink
                   (VariableNode "S1")
                      (ProteinNode "$P2"))
        )
       (EvaluationLink
           (PredicateNode "controls production of")
               (ListLink
                   (ProteinNode "$P2")
                      (VariableNode "S2"))
        )
   )
 )
 )

 CHEMICAL-AFFECTS

(EquivalenceLink
(EvaluationLink
(PredicateNode "chemical affects"
(ListLink
                                (ConceptNode "$P1")
                                (ProteinNode "$P2"))
    )
(PredictiveImplicationLink
(EvaluationLink
                        (PredicateNode "interacts with")
                        (ListLink
                                    (ConceptNode "$P1")
                                    (ProteinNode "$P2"))
        )
(EvaluationLink
                        (PredicateNode "is modified")
                (ProteinNode "$P2")
        )
    )
)
)

 CONTROLS-EXPRESSION-OF

(EquivalenceLink
(EvaluationLink
(PredicateNode "controls expression of")
(ListLink
                                (ProteinNode "$P1")
                                (ProteinNode "$P2"))
    )
(PredictiveImplicationLink
(EvaluationLink
                        (PredicateNode "changes quantity")
                        (ProteinNode "$P1")
        )
(EvaluationLink
                        (PredicateNode "changes quantity")
                        (ProteinNode "$P2")
        )
    )
)

 CONTROLS-TRANSPORT-OF

(EquivalenceLink
(EvaluationLink
(PredicateNode "controls transport of")
(ListLink
                                (ProteinNode "$P1")
                                (ProteinNode "$P2"))
    )
(PredictiveImplicationLink
(EvaluationLink
                        (PredicateNode "interacts with")
                        (ProteinNode "$P1")
                        (ProteinNode "$P2")
        )
(EvaluationLink
                        (PredicateNode "changes location")
                        (ProteinNode "$P2")
        )
    )
)

 CONTROLS-TRANSPORT-OF-CHEMICAL

(EquivalenceLink
(EvaluationLink
(PredicateNode "controls transport of chemical")
(ListLink
                                (ProteinNode "$P1")
                                (ConceptNode "$P2"))
    )
(PredictiveImplicationLink
(EvaluationLink
                        (PredicateNode "interacts with")
                        (ProteinNode "$P1")
                        (ConceptNode "$P2")
        )
(EvaluationLink
                        (PredicateNode "changes location")
                        (ConceptNode "$P2")
        )
    )
)
leungmanhin commented 4 years ago

@mjsduncan You mentioned in the call yesterday that you want to have a standardized human-readable name for each of the molecule (in place of the "AID" in the comment I made above), so I'm wondering which DB should we use as the "standard" names?

Habush commented 4 years ago

This issue by @linas is relevant here

mjsduncan commented 4 years ago

i think we should use ChEBI as the standard since we are already importing some of the ChEBI ontology via GOplus. ChEBI is integrated with the other bio-databases funded by the european union: (https://elixir-europe.org/platforms/data/core-data-resources)

mjsduncan commented 4 years ago

@linas suggestions: https://github.com/MOZI-AI/annotation-scheme/issues/117

noskill commented 4 years ago
EvaluationLink
  PredicateNode “has_chebi_id”
  ListLink
    MoleculeNode “AID:4321”
    MoleculeNode “ChEBI:28987”

There is issue with this approach: while this means that MoleculeNode “AID:4321” and MoleculeNode “ChEBI:28987” refer to the same entity, no code in atomspace will use it. The same would also be true for pattern miner.

I think we can update DefineLink so that it would support arbitrary node types. DefineLink allows to assign new names to existing DefinedSchema and some other types. So when alternative name encountered by pattern matcher it is replaced with the node it points to.

linas commented 4 years ago

When this data is stored in a file, it would be more efficient to write

(define qi (PredicateNode "quantity increased"))
(EvaluationLink  qi (ListLink 
           (MoleculeNode “ChEBI:28987”) (MoleculeNode “ChEBI:1111”)))

This does two things:

By contrast, with the (define qi ...) form, the atomspace interaction is done only once; guile remembers the result, and every time qi is used, the atom is already there, ready to go; no new atomspace interactions are required.

There is also a tiny amount of savings by writing this:

(define ev EvaluationLink)
(define ll ListLink)
(define m MoleculeNode)
(ev qi (ll (m “ChEBI:28987”) (m “ChEBI:1111”)))

which makes the whole file much smaller .... I believe that simply having fewer bytes in the file makes it faster to read in. (and of course, smaller to download) I have not measured the performance of this last suggestion, so I dunno.

noskill commented 4 years ago

@linas Could you please help: is there easy way to replace one node with another in atomese, or to create new links with replacement in child atomspace?

linas commented 4 years ago

is there easy way to replace one node with another in atomese, or to create new links with replacement in child atomspace?

I don't understand the question. You proposed using DefineLink, up above. That should work. There's no unit test for it working in a child atomspace, but I think it should work there. You would have to remove the orignal Define first, before makeing a new Define.. and I bet that removing atoms in a child atomspace is broken ..that's a bug.

The StateLink is a lot like DefineLink, except that you can change it at any time. But, unlike the DefineLink, it is not automatically expanded in the pattern matcher. I'll bet that StateLink acts weird in child atomspaces, too. That's another bug.

Oh, but if your question was specifically "does DefineLink work with child atomspaces?", the answer is "no, its almost surely buggy" and "someone should fix the bugs". The fixes would not be easy: some trickery would be needed in the code.

noskill commented 4 years ago

Those are different usecases/approaches: 1) Keep different names, but connect them by DefineLinks 2) Do preprocessing - replace inconvenient nodes..

I wish that both usecases were easy for an atomspace user. Michael thinks that in majority usecases we better to use approach 2. So do you know how to replace one node with another in a hypergraph using atomese?

linas commented 4 years ago

Those are different usecases/approaches

I still don't understand. What are you trying to do? What's inconvenient? What's the problem that needs to be solved?

how to replace one node with another

Sure, easy. for-each(incoming set of old node) { create new link w/new node } and then delete-recursive old node. Approx dozen lines of code.

using atomese

No, at least, not the first part. The second part, yes, since DeleteLink is recursive:

(cog-execute! (Put (Delete (Variable "x")) (OldAtom)))

The Put installs OldAtom into Delete, which goes "I cannot exist like this" and self-destructs.

Do you really have to have the old->new replacement in atomese? can't you just do it in scheme or python?

noskill commented 4 years ago

Do you really have to have the old->new replacement in atomese? can't you just do it in scheme or python?

I don't have to, sure, but using other languages for tasks that are best suitable for atomese kind of destroys my motivation of using it at all. wiki page written by you https://wiki.opencog.org/w/Atomese says atomspace is graph rewriting system. So i want to load some ontology into atomspace as it is and then rewrite it.

linas commented 4 years ago

and then rewrite it.

It would be useful if you explained what the problem actually is, that needs to be solved. I do not understand what it is that you are trying to do, and why the existing solutions do not work for you. As a result, I cannot make any further suggestions. You seem to be unhappy about having to use the atomspace ... but you don't articulate why. I cannot help if I don't understand what the problem actually is.

mjsduncan commented 4 years ago

@Linas Vepstas linasvepstas@gmail.com one of the use cases for what @anatoly is asking about is that there are multiple external dbs being imported with the same entities but that include different contextual information or even contradictory information. how can we indicate that one is replaceable by the other in an inference process? my understanding is that there are 2 options:

  1. DefineLink NodeX NodeY, DefineLink NodeZ NodeY to set NodeY as the bio-atomspace reference for that particular entity, or
  2. SimilarityLink NodeX NodeY, SimilarityLink NodeY NodeZ to make X, Y, Z equivalent without preferring a particular representation.

On Tue, Apr 28, 2020 at 5:14 PM Linas Vepštas notifications@github.com wrote:

and then rewrite it.

It would be useful if you explained what the problem actually is, that needs to be solved. I do not understand what it is that you are trying to do, and why the existing solutions do not work for you. As a result, I cannot make any further suggestions. You seem to be unhappy about having to use the atomspace ... but you don't articulate why. I cannot help if I don't understand what the problem actually is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MOZI-AI/knowledge-import/issues/10#issuecomment-620858491, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGJSQRQ7S5DDI5OE3XJD23RO5BLDANCNFSM4MLASJBA .

linas commented 4 years ago

@noskill suggested using DefineLink himself, in https://github.com/MOZI-AI/knowledge-import/issues/10#issuecomment-620041520 and I gave it a thumbs up. He then asked about child atomspaces and I replied explicitly in https://github.com/MOZI-AI/knowledge-import/issues/10#issuecomment-620733102 that that's OK and should work too (clarifying that conflicting defines wouldn't work. By "conflicting" I mean having (Define A something) in parent atomspace, and also (Define A other-thing) in child atomspace, because now you don't know which one A is supposed to be. This is technically a bug and would need to be fixed. But, as far as I can tell, no one wants to use defines like this.)

So all the proposed solutions seemed OK to me, but there was continued unhappiness, and I don't understand what there is to be unhappy about.

To repeat:

  1. DefineLink NodeX NodeY, DefineLink NodeZ NodeY to set NodeY as the bio-atomspace reference for that particular entity,

Should work just fine. Pattern matcher handles this automatically.

  1. SimilarityLink NodeX NodeY, SimilarityLink NodeY NodeZ to make X, Y, Z equivalent without preferring a particular representation.

This works too, but you would have to create search patterns that explicitly look for similarities. They are not automatically explored.

vsbogd commented 4 years ago

I believe what Anatoly means the following example saying about "graph rewriting". If we have something like:

(Member
  (Concept "X")
  (Concept "S"))
(Evaluation
  (Predicate "P")
  (List
    (Concept "X")))

where (Concept "X") is used under few different links. Then we cannot write single BindLink to replace (Concept "X") by (Concept "Y") in all contexts and get:

(Member
  (Concept "Y")
  (Concept "S"))
(Evaluation
  (Predicate "P")
  (List
    (Concept "Y")))
ngeiswei commented 4 years ago

Yes, what @noskill wants is a rule of replacement https://en.wikipedia.org/wiki/Rule_of_replacement for atomese. The trick is that it must recurse over all incoming links. I don't think we have that out-of-the-box, but it should be relatively easy to implement with the rule engine. In fact we already have something similar with the URE-based reduct engine (which is too messy and experimental to add a link ATM).

I'm gonna attempt to build an example of that and will post it here.

linas commented 4 years ago

Rule of replacement

In https://github.com/MOZI-AI/knowledge-import/issues/10#issuecomment-620756699 I describe the rule of replacement. Its about 10 lines of code. Maybe less if you are clever. The implementation does NOT require the rule engine! Its dirt-simple! I mean its like really really really dirt simple. I already gave you the pseudocode; what's wrong with that?

I mean, we have 2 or 3 or 4 places in the c++ code, where replacement is done: its usually to replace a variable node by some other atom. There are multiple variations that are custom-tailored to suit the particular use-cases, and so they usually do more than simple replacement. For example, Instantiator.cc - it originally just did just simple replacement, but over they years accumulated a huge amount of extra cruft. The alpha-rewriting code is another example: it goes through, and replaces every instance of X by an instance of Y. There's several other spots in the C++ code, where replacing X by Y is needed, and is done, on the spot.

I'd write this code for you, but do you want scheme? C++? Python? Again, the pseudocode is this: for-each(incoming set of old node) { create new link w/new node } this is not hard.

The C++ code has this file: atoms/core/FindUtils.h which provides a large variety of tools for finding atoms in different contexts, because sometimes you do want to substitute Y for X and sometimes you don't: examples include making a substitution only if X is unquoted (this is code that Nil added) or substituting X only if it is not in a scoped context.

ngeiswei commented 4 years ago

There's nothing wrong with having replacement implemented in C++ or Scheme. It's just that since Atomese is kind of a rewriting system, it feels it should be able to do it out of the box. It's also somewhat a form of reasoning. BTW, I implemented it in the URE, I didn't test it yet but it should work.

linas commented 4 years ago

do it out of the box.

Again: I don't understand. Replacement tends to be highly-case-specific. When doing alpha-conversion, the replacement is only on those atoms that are in scope. When doing unification, you're doing it only on the graphs to be unified. The Atomese version of replacement is called PutLink, which explicitly replaces the declared locations with the provided contents. In general, there's a variety of different situations that call for substitution, and they all tend to have slightly different requirements. For example, in C++ Substitutor.h provides generic substitution, but VariableList.h extends this with a huge amount of extra features to allow all kinds of complex, fancy replacements to be done.

Without explaining why you can't use PutLink or MapLink to do the job, or just writing a simple for-loop iterating over the incoming set ... I'm stumped ...because you still haven't explained what it is that you are actually trying to do.

linas commented 4 years ago

OK @noskill @vsbogd @ngeiswei I pondered what you said, and I think I have a proposal that solves the generic problem that you describe. It's at https://wiki.opencog.org/w/ContainerLink. It's a rather long proposal, with many examples. That means that I think I like it - its powerful enough to handle many different use cases in a consistent manner. It seems to be a reasonable swiss-army-knife of matching unknown containing graphs. Also, rather than further pollute this issue with this thread, lets move the conversation to a new thread -- here: opencog/atomspace#2562

ngeiswei commented 4 years ago

Yeah, that could be an option, and potentially more efficient and compact than the URE implementation I'm gonna send shortly (though perhaps less controllable, that remains to be seen in practice).

linas commented 4 years ago

URE implementation

Can you set up a mock prototype somewhere, e.g. on a wiki page?

https://wiki.opencog.org/w/ContainerLink

ping @noskill @vsbogd please take a look. Implementing at least the most basic parts of ContainerLink seems to be easy, and I seem to be on a coding jag this week. Next week, I might not feel like doing this :-) Strike while the iron is hot.

ngeiswei commented 4 years ago

Here's the URE replacement example I promised

https://github.com/opencog/ure/tree/master/examples/ure/replacement

linas commented 4 years ago

https://github.com/opencog/ure/tree/master/examples/ure/replacement

Looks nice!

@ngeiswei FYI, later today I'll push and merge a second pass at JoinLink, and I think it would be better to re-do the rb.scm in that example to use JoinLink.

Also: FYI: The QueryLink is identical to BindLink, except it returns a LinkValue instead of a SetLink and so the atomspace doesn't get polluted with random SetLinks that have to be manually removed. Similarly, MeetLink is the same as 'GetLink`, but without the Set. It's not urgent (or even important) to replace one by the other... it might be more convenient .. I dunno. Whether or not its faster is unknown/unmeasured. (Should be a little faster, but how much? probably not much at all.) So this change is more about having a less-crufty API than anything else.

Oh .. and I should mention: JoinLink does NOT return SetLinks! It returns LinkValues ... I have no clue if this craters URE or not.

noskill commented 4 years ago

I think it's worth to verify all imported data against atomese templates, analogous as it is done with json schema and xml schema.

linas commented 4 years ago

verify all imported data against atomese templates

FYI, there is a built-in type-checker, for simple kinds of types in ClassServer::addValidator() but this is NOT enough for general-purpose use e.g. by by this project.

There is a TypedAtomLink which is meant to hold a definition of a type, see https://wiki.opencog.org/w/ArrowLink for a BIO-AI example from many many years ago -- but it has never been used, as far as I know.

I am currently working on getting more of the "sheaf theory" stuff implemented. It is meant to provide a more generic mechanism than types/typing: it defines the ways that atoms can connect to each other. I've just barely started work on it, though, so the checking/verification code is in other projects, not in the atomspace yet, and its not yet generic.

It would be very useful if you could provide specific examples of the kinds of templates you want to match, or what imported data would need to look like. Just one or two examples...

noskill commented 4 years ago

@linas sure, here are some examples:

I generate data that looks like that:

(MemberLink
    (EvaluationLink
        (PredicateNode "catalysys_of")
        (ListLink
            (MoleculeNode "Uniprot:Q05940")
            (EvaluationLink
                (PredicateNode "transport_of")
                (ListLink
                    (MoleculeNode "ChEBI:18243")
                    (MoleculeNode "ChEBI:18243")))))
    (ConceptNode "PA166181140"))

I think i would like to ensure that EvaluationLinks contain only predicates from a predefined list of acceptable names like: (PredicateNode "transport") (PredicateNode "catalysys"). So "transport_of" should raise an error.

Ensure that if evaluation link contain predicate (PredicateNode "transport") then there is a MemberLink connecting the evaluation link to a ConceptNode.

Also i would like to run some external code from atomese to verify MoleculeNode names, like it should be ChEBI or PubChem but not something else, but that is easy to do.

noskill commented 4 years ago

perhaps it would help if we were able to detect if a newly imported data forms a disjoint graph

noskill commented 4 years ago

@Necr0x0Der

linas commented 4 years ago

predefined list of acceptable names

I've been thinking about a LexicalNode which works with a predefined list of acceptable names, but you would have to use that instead of PredicateNode. I don't want to filter generic PredicateNode because atom-insertion is already too slow, and I don't want to slow it down any more. Perhaps you could define a custom BioPredicateNode which inherits from both LexicalNode for checking, and PredicateNode so that URE/PLN/MOSES are happy. Note however (1) LexicalNode has not been implemented. (2) It sounds like you are dealing with "programming errors" in the importer, and these should be dealt with in the importer, and NOT at runtime in the atomspace. Doing anything at runtime (anything more than the absolute minimum) is always a bad idea.

(MoleculeNode "ChEBI:18243")

As previously suggested, you should create a ChebiMoleculeNode and a UniprotMoleculeNode or alternately create RegexNode (as described in opencog/atomspace#2474) The reason for this is that there is a fair amount of CPU time lost in code that gets the node-name, and then does a string-compare to ChEBI: or UniProt: and then branches on that string compare. You can gain some fair amount of performance by removing it. Performance measurements are given in this report, here: MOZI-AI/annotation-scheme#141

See here: https://github.com/MOZI-AI/annotation-scheme/blob/833965618338d885c809d9ff0c3c2bd8cca6c2bc/annotation/functions.scm#L426-L433 and here: https://github.com/MOZI-AI/annotation-scheme/blob/833965618338d885c809d9ff0c3c2bd8cca6c2bc/annotation/gene-pathway.scm#L78-L82 which was measured here: https://github.com/MOZI-AI/annotation-scheme/issues/141#issuecomment-595892834 so basically the string-compare is about 50x or 100x slower than the pattern-matcher doing the same thing...

noskill commented 4 years ago

Linas, you kind of ignored my main point, that is defining templates for data in bioatomspace, and applying them to verify all the data. There is no need for string comparison for predicates, since we can compare atoms, which is done by hashes. Nobody suffers from slow import times, databases are rarely updated.

noskill commented 4 years ago

For ChEBI and stuff i can just run evaluationlink with groundedpredicate running some checks on every MoleculeNode

linas commented 4 years ago

Nobody suffers from slow import times

You cannot make the atomspace run slow --

So although the MOZI dataset might be updated infrequently, this is not true for other users.

There is no need for string comparison

Did you read the code I linked to? It is doing string compares, lots of them.

linas commented 4 years ago

Finding bad PredicateNodes is easy -- you can either get all PredicateNodes in python (I assume there is a python call atomspace.getAtoms(Type t) that returns everyatom of type t) and check to see if they are on a list, or it you prefer, the following atomese will also work:

(use-modules (opencog) (opencog exec))

(Member
   (Predicate "transport")
   (Concept "set of valid agi-bio predicate names"))

(PredicateNode " bad name transport_of")

(cog-execute!
(Get
   (TypedVariable (Variable "$bad predicate") (Type 'PredicateNode))
   (And
      (Present (Variable "$bad predicate"))
      (Absent (Member (Variable "$bad predicate") 
         (Concept "set of valid agi-bio predicate names"))))))
noskill commented 4 years ago

That's context-independent test, and also not declarative like xml or json schema are.

What if some predicate is acceptable with some arguments, and not others?

    (EvaluationLink
        (PredicateNode "catalysys_of")
        (ListLink                                      
            (MoleculeNode "Uniprot:Q05940")             ; It's only MoleculeNode or SetLink of MoleculeNodes

I see why CheBi node might be useful at inference time. Still performance arguments are not that relevant for the process of db conversion, since source dabaseses are updated infrequently.

linas commented 4 years ago

CheBi node

The string compares are performed during search, not during "db conversion".

db conversion

I'm not sure what "db conversion" is supposed to be. I thought you had some data import scripts, and they are buggy. The obvious answer is "fix the bugs" ...

if some predicate is acceptable with some arguments, and not others?

There are three ways to check. They all start with a type declaration. For your example, the type declaration would be:

(Signature
      (EvaluationLink
        (PredicateNode "catalysys_of")
        (ListLink                               
            (TypeChoice 
                   (Type 'MoleculeNode)
                   (Set (Type 'Molecule))))))

And you can use that directly, either with the pattern matcher, declaring variables to be of that type, or with the type utility functions. It works.

The scheme utility funs are in (use-modules (opencog type-utils)) and are called cog-value-is-type?, cog-type-match? and cog-type-compose See the docs. I don't think they have python bindings, the c++ code is in TypeUtils.h

Basically, you would use value-is-type? and give it the SignatureLink above, and whatever you want to check, and it replies yes/no.

The third way is TypedAtomLink but I doubt it works, because I don't think it was ever used. https://wiki.opencog.org/w/TypedAtomLink

linas commented 4 years ago

FYI: Everyone should switch to using @noskill fast-file-load code, if not already. On my system, the data files load in a little over a minute, instead of ten minutes, as before. The use it, just say:

(use-modules (opencog persist-file))
(load-file "ChEBI2Reactome_PE_Pathway.txt.scm")  ;;; or whatever.
Habush commented 4 years ago

FYI: Everyone should switch to using @noskill fast-file-load code, if not already.

It is fast. But one thing I noticed is it doesn't parse truthvalues on atoms. For example:

(EvaluationLink (stv 1.0 0.913)
  (PredicateNode "binding")
  (SetLink
    (GeneNode "ARF5")
    (GeneNode "BET1")))

The above fails with a Syntax Error.

linas commented 4 years ago

doesn't parse truthvalues

Yeah, I spotted a "todo" in the code and assumed that a pull req for this would be forthcoming. Maybe @noskill forgot. Since I've done 1001 other atomspace pull reqs this week, I can try adding this now.

linas commented 4 years ago

doesn't parse truthvalues

@Habush fixed in opencog/atomspace#2634

linas commented 4 years ago

@mjsduncan @leungmanhin once you have the REACTS-WITH INTERACTS-WITH etc in comment https://github.com/MOZI-AI/knowledge-import/issues/10#issuecomment-617941699 what queries do you plan to have to go with it?

For example, the Jan 2020 timeframe code had something called find-output-interactors -- terribly misnamed, because it looked for triangles of three genes, that interacted with one-another. Another mis-named routine searched for pentagons, involving 2 interacting genes, expressing 2 proteins that sit in the same reactome.

So, besides the list of data markup listed in the comment up top, I think that a list of "worth-while queries" or "worthwhile inter-relationships to mine for" would also be useful to write down.

FYI, I've taken the triangle search, and generalized it to search for tetrahedrons (4 genes that all interact with one-another). One could also search for 5,6,...etc.

For the january 2020 datasets, there are 49K genes but only 20K of them interact in pairs (there are 365K such pairs out of (20K)^2 possible, so very sparse.) Of these, only 13K of the genes form up into triangles; there are 1.8M triangles (out of (20K)^3 possible). Of these, 10K genes form tetragons, there are 9.2M tetragons (out of of (20K)^4 possible). So far, the 5-cliques are too cpu-expensive to compute.

So the meta question is "who thought that triangles of interacting genes is an interesting thing to look for?" and "why not tetragons?" and so on...

mjsduncan commented 4 years ago

@linas i don't know what code yr specifically referring to with triangle and and pentagon interactions but there are 2 levels of analysis that the annotation service atomspace is set up for: looking at protein interactions directly and the higher level abstraction of interactions between their corresponding genes. also you should be doing your interaction searches using the STRING interaction data instead of the biogrid. it has many more links, different link types including directed/causal, and is explicitly defined at the protein levels vs biogrid which is defined at the gene level. also the annotation service is specifically designed to find information related to input genes. we are planning to expand the annotation-scheme code to allow for annotation of arbitrary node types and make it a more useful tool for getting relevant sub-hypergraphs of the atomspace as a pre-processing step to reduce the search space for other processes when this is feasable.

linas commented 4 years ago

@mjsduncan -- The MOZI/scheme-annotation code has a function called find-output-interactors -- I didn't create it, it was already there, I had to debug it back in January.

If you look at what it does, you will see that, given gene-a, it searches for gene-b and gene-c such that a interacts with b and b interacts with c and c interacts with a. This is a triangle. I hope it is obvious that this shape is the shape of a triangle - three corners-the genes, and three edges - the interactions between them.

So I have some questions: Who decided that searching for triangles is an interesting thing to code up and make part of the MOZI function set? Can we give it a better name, than find-output-interactors? And ... why stop at triangles? why not other shapes?

Likewise, there is some code called pathway-gene-interactors that searches for pentagon shapes. The five corners of the pentagon are two genes, two proteins, and one pathway. The five edges are: both proteins are in the pathway, each gene expresses one of the proteins, and the two genes interact. Five corners, five edges, big loop. Its a pentagon. Since the code has no documentation, this might not be obvious. It would be great if someone wrote some documentation that explained what the code does, and why someone thinks that this pentagon is an interesting shape to search for.

I'm not trying to be a pain in the neck -- I'm an outsider looking in, and seeing these things. I just think they should be explained and documented, because 99.9% of all biologists are not going to be reading Atomese code to figure out what it actually does, and what you are data-mining.

Again -- I'd really like to get you to answer these questions: who thought that data-mining for triangles and pentagons was interesting, and why stop there? What are the other interesting shapes that you (or whomever) might think are worth data-mining? Or rather, why did these two get special treatment -- hard-coded into the code base, when so many other searches are not in there?

Thanks for the hint about the STRING data. I will take a look at it. Doing any of this is very time-consuming, both in personal time and in CPU-time. It's a toy project, I'm just trying to understand the "shape" of the netowrk, in my own way; my code is at https://github.com/linas/biome-distribution.git

tanksha commented 4 years ago

@linas

As you have explained, what the find-output-interactors function does is for a given gene A, it searches for a biogrid interaction between any genes which interacts with the given gene A (which forms a Triangle). By the way, the function is named after its functionality, as we first search for interactors of a given gene A, lets say we found B, C and D as a result, and then we search for interaction between those output genes, i.e if those output genes (B,C and D) have interaction, that relationship should also be included in the result.

Who decided that searching for triangles is an interesting thing to code up and make part of the MOZI function set?

Its the problem we solving which leads us to search for triangles, there might be other interesting patterns to look for, but triangles were what we needed to complete the graph, at the time being.

linas commented 4 years ago

Thanks @tanksha .. you said:

Its the problem we solving which leads us to search for triangles

What's the problem you are solving?

tanksha commented 4 years ago

What's the problem you are solving?

Find the biogrid interactors of a given gene, and an interaction between them, that's what we call it a biogrid annotation. Which will result a sub-graph from the biogrid dataset with a root being the given gene.

We will also include the proteins expressed by each gene and create a PPI (protein protein interaction) which then forms a rectangle shape. (Note here, a gene can express one or many proteins. in order to get the specific protein used for a given interaction, we matched the "BiogridId" of the gene and the protein, you might notice this from the Pattern matching query)

we also have pentagon shaped patterns to solve a problem of finding interacting genes in a same pathway, as you see in pathway-gene-interactors function.

mjsduncan commented 4 years ago

@linas the gene annotation service is basically querying the bio-atomspace for a specific subgraph that contains the input genes, the details of which are specified on the service input screen. if you play around with the service (there are instructions to run it locally in a docker container) then the code will prolly make more sense to you