MOZI-AI / annotation-scheme

Human Gene annotation service backend
GNU General Public License v3.0
3 stars 4 forks source link

Add covid filter for biogrid #204

Closed Habush closed 4 years ago

Habush commented 4 years ago

This PR adds a filter that specifies whether to include COVID-19 genes in the biogrid annotation. @linas can you please check if my usage of NotLink in this context is correct?

linas commented 4 years ago

usage of NotLink in this context is correct?

It isn't; you want AbsentLink instead. A short explanation: GetLink has the form:

    Get
          <variable decls>
          <clauses>

The clauses are a list of things that are meant to evaluate to true/false. The single most important clause is PresentLink which says "everything wrapped by PresentLink must be present in the atomspace". It is so important that using it is optional! Almost everything you've ever written for MOZI has used only these implicit PresentLinks. If you want to make sure something isn't there, use AbsentLink. I think you can say (NotLink (PresentLink xyz)) and it will do the same thing as (Absent xyz)

If instead, you say (Not xyz) it will assume that xyz is evaluatable, (generates a truth value) and it will take that truth value, invert it, and make a decision based off of that. It does NOT automatically wrap xyz with a PresentLink! Typically, you would use Not with grounded predicate nodes, or other things you want to compute (greater-than, equal, etc.)

One last note: you did not need to use the AndLink. If you just list two or more clauses after the variable decls, they are implicitly anded together -- that is, they must all be true -- they must all be "satisfied" for the pattern to match.

linas commented 4 years ago

Oh, and after looking at your code, more carefully, there is another, completely unrelated bug in it. You wrote

                  (Not 
                           (Evaluation (Predicate "from_organism")
                              (List 
                                 gene
                                 (ConceptNode "TaxonomyID:2697049")
                              )))

This has no variables in it!!! Thus it is a "constant clause", and the pattern matcher ignores it. (because it is always the same, it doesn't specify a search, and thus can be ripped out of the list of clauses.) But even worse -- you are corrupting the dataset -- for every gen passed in, you are creating (placing in the atomspace)

                           (Evaluation (Predicate "from_organism")
                              (List 
                                 gene
                                 (ConceptNode "TaxonomyID:2697049")
                              ))

and so every gene will be from this taxonomy. Whoops! I think you wanted:

          (Absent
                           (Evaluation (Predicate "from_organism")
                              (List 
                                 (Variable "$a")
                                 (ConceptNode "TaxonomyID:2697049")
                              )))
linas commented 4 years ago

Finally, instead of passing covid as a true-false, instead write:

(define-public (find-output-interactors gene do-protein namespace parents 
                                   coding non-coding not-organism)
    ...
    (Get
     ...
             (Absent
                           (Evaluation (Predicate "from_organism")
                              (List 
                                 (Variable "$a")
                                 not-organism
                              )))

Multiple Absent links are allowed, so you could even allow a list of absent-organisms:

(define-public (find-output-interactors gene do-protein namespace parents 
                                 coding non-coding list-of-not-organisms)
   ...
    (Get
    ...
       (map (lambda (one-org)
             (Absent
                           (Evaluation (Predicate "from_organism")
                              (List 
                                 (Variable "$a")
                                 one-org
                              ))))
              list-of-not-organisms)

and so if you pass in the empty list, there aren't any AbsentLinks. (because map of an empty list is empty). Thus, you avoid the torturous cascade of (if statements.

Habush commented 4 years ago

One last note: you did not need to use the AndLink. If you just list two or more clauses after the variable decls, they are implicitly anded together -- that is, they must all be true -- they must all be "satisfied" for the pattern to match.

This didn't work for me for the clauses in find-output-interactors. The following PM function is giving me an error.

(run-query (Get
            (VariableList
               (TypedVariable (Variable "$a") (Type 'GeneNode))
               (TypedVariable (Variable "$b") (Type 'GeneNode)))
               (Evaluation (Predicate "interacts_with")
                  (SetLink gene (Variable "$a")))

               (Evaluation (Predicate "interacts_with")
                  (SetLink (Variable "$a") (Variable "$b")))

               (Evaluation (Predicate "interacts_with")
                  (SetLink gene (Variable "$b")))
               (map (lambda (org)
                     (Absent 
                        (Evaluation (Predicate "from_organism")
                           (List 
                              (Variable "$a")
                              (ConceptNode (string-append "TaxonomyID:" org))
                           )
                        )
                     )
                  ) orgs)
            ))

The error:

(C++-EXCEPTION "cog-new-link" "Expecting (optional) variable decls and a body; got (GetLink\n (VariableList\n (TypedVariableLink\n (VariableNode \"$a\") ; [6bd8475ba49b36bd][1]\n (TypeNode \"GeneNode\") ; [72afdbafe12c5cff][1]\n ) ; [da13d6fe6ebcfbcc][1]\n (TypedVariableLink\n (VariableNode \"$b\") ; [6968a478572d06e7][1]\n (TypeNode \"GeneNode\") ; [72afdbafe12c5cff][1]\n ) ; [aaab32f9debad47c][1]\n ) ; [f25b1ddc676be512][1]\n (EvaluationLink\n (PredicateNode \"interacts_with\") ; [3c90bc062a581882][1]\n (SetLink\n (GeneNode \"IGF1\") ; [37b88ff182564bf8][1]\n (VariableNode \"$a\") ; [6bd8475ba49b36bd][1]\n ) ; [931eabcc0da9970d][1]\n ) ; [8cef2465d2454afe][1]\n (EvaluationLink\n (PredicateNode \"interacts_with\") ; [3c90bc062a581882][1]\n (SetLink\n (VariableNode \"$b\") ; [6968a478572d06e7][1]\n (VariableNode \"$a\") ; [6bd8475ba49b36bd][1]\n ) ; [9772dd141fcd79e1][1]\n ) ; [c416e4709baaa262][1]\n (EvaluationLink\n (PredicateNode \"interacts_with\") ; [3c90bc062a581882][1]\n (SetLink\n (GeneNode \"IGF1\") ; [37b88ff182564bf8][1]\n (VariableNode \"$b\") ; [6968a478572d06e7][1]\n ) ; [dd8135591f06e4e8][1]\n ) ; [e45b9a76e3cd312b][1]\n) ; [bb4339611ea61d7b][-1] (/tmp/atomspace-master/opencog/atoms/pattern/PatternLink.cc:154)\nFunction args:\n((VariableList\n (TypedVariableLink\n (VariableNode \"$a\")\n (TypeNode \"GeneNode\")\n )\n (TypedVariableLink\n (VariableNode \"$b\")\n (TypeNode \"GeneNode\")\n )\n)\n (EvaluationLink\n (PredicateNode \"interacts_with\")\n (SetLink\n (GeneNode \"IGF1\")\n (VariableNode \"$a\")\n )\n)\n (EvaluationLink\n (PredicateNode \"interacts_with\")\n (SetLink\n (VariableNode \"$b\")\n (VariableNode \"$a\")\n )\n)\n (EvaluationLink\n (PredicateNode \"interacts_with\")\n (SetLink\n (GeneNode \"IGF1\")\n (VariableNode \"$b\")\n )\n)\n ())")

Note: orgs is a list of Taxonomy IDs (which are strings).

linas commented 4 years ago

This didn't work for me

Huh. That's odd. It should have, but maybe that's only for BindLink. I will double-check and fix if needed. In the meanwhile, there is nothing wrong with using the AndLink.

Minor notes: you may want to rename orgs to exclude-orgs and taxonomies to exclude-taxonoimies this makes the code clearer.

You may want to explicitly use PresentLink from now on. Its not really needed ... but it could make the patterns slightly easier to read.

FYI: you should know that there exists something called AlwaysLink which is like PresentLink, except that it requires the clause to be present in each and every result: it must "always" be present. If it's not, the search will fail (no results will be returned). If this sounds interesting but confusing, take a look at the atomspace examples directory, at the always.scm example.

linas commented 4 years ago

This didn't work for me

Huh. That's odd.

OK, I was wrong/mis-remembering things. What I was remembering incorrectly was this: you can write BindLinks in the form:

BindLink
    <var-decls>
    <body-to-match>
    <rewrite-1>
    ...
    <rewrite-n>

That is, you can trigger multiple rewrite rules in one pass. You don't have to wrap them in a ListLink or anything like that.

linas commented 4 years ago

... and one last minor remark: if you have only one variable, you do not need to put it in a VariableList. Again, this can make things slightly easier to read, in a few cases.

Habush commented 4 years ago

Okay. I will update the code with your suggestions tomorrow.

Habush commented 4 years ago

@linas can you please check?