Avoid the use of ListLink with many outgoing nodes

tanksha commented 4 years ago

@linas @Habush

The scheme result of the annotation-service is supposed to be used for Pattern miner and PLN works. As the bio-atomspace is too big to work with, we wanted to have a smaller version by annotating the ~800 genes from the mosses model. The problem with the annotation result is it has additional ListLink's ( which we used For example to return >1 hypergraphs on certain condition

check here

The ListLink causes error (when the scheme result is used as a dataset for the Pattern miner code here https://github.com/ngeiswei/reasoning-bio-as-xp)

a ListLink is considered not not have many outgoing nodes. What should we do to get rid of these ListLinks?

linas commented 4 years ago

bio-atomspace is too big to work with

? What does this mean? in what way is it "too big"? Does it not fit in RAM? Takes too long to load? PLN is too slow? Pattern miner is too slow?

The problem with the annotation result is it has additional ListLink's (check here)

I did not design this system, and I don't understand the data flow. If I "check here", I can clearly see that the ListLink is being used to wrap together a collection of related results. Why should things be wrapped together? I don't know -- I have to assume some other system needs them wrapped up like this. Which system needs them wrapped up is not documented. What it does with this wrapped-up data is not documented.

I mean, you could change generate-result to return an ordinary list, but I don't know what that might break or not break.

What should we do to get rid of these ListLinks?

I will create a pull request right now to make it easier to get rid of the ListLink, without actually getting rid of it. You could even turn them on and off with a simple flag.

It's important to write up at least a few paragraph define who the users of annotate-scheme are, and what data formats they expect, and why.

tanksha commented 4 years ago

? What does this mean? in what way is it "too big"? Does it not fit in RAM? Takes too long to load? PLN is too slow? Pattern miner is too slow?

I should say they are interested in experimenting with the ~800 genes only for now.

I can clearly see that the ListLink is being used to wrap together a collection of related results. Why should things be wrapped together? I don't know -- I have to assume some other system needs them wrapped up like this.

Two main reasons that we used ListLink

A function in GroundedSchemaNode calls, should return an opencog object (which makes us use ListLink instead of scheme list)

But as of now, we only have few GroundedSchemaNode so, it will be easier to get rid of the ListLink I think.

The result of annotate-genes function is wrapped with ListLink because we were having issue loading back the result into atomspace (as it is was a scheme list we were using)

It's important to write up at least a few paragraph define who the users of annotate-scheme are, and what data formats they expect, and why.

Mainly its @mjsduncan , Nil, Manhin ... are working to find interesting patterns from the ~800 genes annotation result using PLN and pattern miner. Recently Nil wrote this code here https://github.com/ngeiswei/reasoning-bio-as-xp to do experimenting.

The data format they expect is atomese but there should not be a giant ListLink with many outgoing nodes, which require us to get rid of the ListLinks we used to wrap many EvaluationLinks in one.

linas commented 4 years ago

I will create a pull request right now

See #165 -- this moves the ListLink from generate-result to the users of generate-result, which should make it easier to remove, if desired. There are many many other uses of ListLink all through that code, and until there is some general understanding of why these are needed, or what they do, they cannot be removed.

There are two or three different/related solutions that are possible.

Instead of using nested (ListLink (ListLink (ListLink ...))) structures to return data, you could create JSON structures with labelled fields holding the annotated data.
Instead of using nested (ListLinks...), use scheme records. My understanding is that they look JSON-like, in being structured data records (kind-of-like C/C++ structs or classes, or like database records).
Store annotation results as Atomese Values on specific Atoms. This has several advantages over the above two. One advantage is that this allows memoization: either a given value is there, and you can immediately use it, or it is not there, and so it needs to be computed. Like JSON, values are named fields, so you can access fields by name. Since Values live with the Atoms, they can be saved/restored: you can have memoization over long periods of time.

All three approaches seem to be equally complicated. My personal preference would be to use Values, but I won't complain (very much) if you don't use them.

linas commented 4 years ago

GroundedSchemaNode

There are only five of these left. They could be eliminated. #166 and #167

Anyway, this is not the reason why ListLinks are all over the place.

ngeiswei commented 4 years ago

@linas to be clear the arity in question was in order of tenth of thousands. Replacing the atomese List by scheme list changed nothing, meaning it is IMO a limitation with guile rather the atomspace. Nevertheless these large list links can and should be avoided as it's been agreed.

linas commented 4 years ago

@ngeiswei OK, so here is what I see: There are many places where a half-dozen searches are performed, to find gene interaction and protein expression and cellular location and what-not, and the result of these searches are a half-dozen "things" wrapped in a ListLink. I have no clue why those things are being wrapped in a ListLink; I assume some down-stream module needs them in these bite-size chunks. I was proposing that the structure of these bite-sized chunks be redesigned to be more meaningful.

Most things that I see wrapped in a ListLink seem to contain only half-a-dozen things. Now, maybe one of those "things" is actually a list of thousands of other "things", but that is non-obvious from the code.

It is also possible that something, somewhere else, is wrapping the entire collection of these bite-sized chunks in one giant ListLink. If so, it should be hunted down and "fixed". But again, without knowing why it was wrapped in the first place, I don't know how to "fix" it.

linas commented 4 years ago

OK, well, so I looked. @ngeiswei there is this spot right here:

https://github.com/MOZI-AI/annotation-scheme/blob/26243ad7e3350647d91f7ab57252e90eb5de617c/annotation/biogrid.scm#L60-L61

The two ListLink's there wrap up lists with tens-of-thousands of entries in them

linas commented 4 years ago

Also here: https://github.com/MOZI-AI/annotation-scheme/blob/26243ad7e3350647d91f7ab57252e90eb5de617c/annotation/gene-go.scm#L44

and here: https://github.com/MOZI-AI/annotation-scheme/blob/26243ad7e3350647d91f7ab57252e90eb5de617c/annotation/rna.scm#L50

and here: https://github.com/MOZI-AI/annotation-scheme/blob/26243ad7e3350647d91f7ab57252e90eb5de617c/annotation/gene-pathway.scm#L68-L69

and here: https://github.com/MOZI-AI/annotation-scheme/blob/26243ad7e3350647d91f7ab57252e90eb5de617c/annotation/main.scm#L75

and I think that's it. All of these are immediately followed by a call to write-to-file So this is something about how file-data is represented/marked up.

I'm trying to think about this problem generically/abstractly, right now, like "what would SQL do" or "what would other graph db's do" or "what would javascript do" or "what would R do". if it was in this situation. I'm not sure.

I mean, the old rough-n-ready solution would be to just say:

(for-each
    (lambda (thing)
         (write-to-file 
              (MemberLink thing (Concept "biogrid-interaction-annotation"))))
   list-of-things)

so that each "thing" is tagged. Now you get 10K MemberLinks .... I don't know if that is "better" or not...

ngeiswei commented 4 years ago

Now you get 10K MemberLinks

Yeah, that's also what I recommended.

Habush commented 4 years ago

I have no clue why those things are being wrapped in a ListLink;

The outer ListLinks (like the one you pointed above) are used for identifying from which annotation database (in terms of the code function, i.e whether gene-go, biogrid..etc) the results are coming from while parsing them to JSON.

linas commented 4 years ago

@Habush what is this json format you speak of? Is it documented somewhere? Atomese was designed to look "json-like" but it's possible the design is not really good enough, and at any rate, very few active users think of it as being json-like, (with you being almost the sole exception).

(Kind-of unrelated to this, but maybe we should have a generic atomese->json and json->atomese export/import functions.... see issue #179 for proposal)

tanksha commented 4 years ago

So, Based on Nil's recommendation, the result of each annotation will have a MemberLink to the conceptNode (which is the name of the annotation) for example the following biogrid results will each have a MemberLink to the conceptNode (ConceptNode "biogrid_interaction_annotation") ?

   (MemberLink
      (EvaluationLink
         (PredicateNode "has_pubmedID")
         (ListLink
            (EvaluationLink
               (PredicateNode "interacts_with")
               (ListLink
                  (GeneNode "MAP2K4")
                  (GeneNode "MAP3K7")
               )
            )
            (ConceptNode "pubmed:12429732")
         )
      )
   (ConceptNode "biogrid_interaction_annotation")   
   )

   (MemberLink
      (EvaluationLink
         (PredicateNode "has_pubmedID")
         (ListLink
            (EvaluationLink
               (PredicateNode "interacts_with")
               (ListLink
                  (GeneNode "MAP2K4")
                  (GeneNode "MAP3K7")
               )
            )
            (ConceptNode "pubmed:12556533")
         )
      )
   (ConceptNode "biogrid_interaction_annotation")
   )
.....

Habush commented 4 years ago

@Habush what is this json format you speak of? Is it documented somewhere?

I was explaining the reason why the outer ListLinks ((ListLink (Concept "biogrid-interaction) ....)) are being used right now. It is to help the parser group nodes and edges. And by group I mean simply add a group field to JSON objects.

But like @ngeiswei suggested and @tanksha demonstrated above, this could be replaced with MemberLinks. Although as you said, I am also unsure how having thousands of MemberLinks would fare better than huge ListLinks in terms of performance.

(Kind-of unrelated to this, but maybe we should have a generic atomese->json and json-atomese export/import functions....)

I agree. Atomese needs to be exported into popular data formats for interoperability with other systems. We can start with JSON and then add XML, YAML as we move forward.

Habush commented 4 years ago

BTW @linas this, in a way relates, to what you suggested a while ago: to decouple the annotation and the parser code and the develop the annotation module into its own separate system.

linas commented 4 years ago

having thousands of MemberLinks would fare better than huge ListLinks in terms of performance.

My NLP databases have dozens of millions of atoms in them. A few thousand is a drop in the bucket. Your databases ... I'm not sure, but I'm guessing that your default datasets have around a million atoms (I will measure this shortly) so a few thousand more doesn't affect anything.

Large ListLinks do not hurt the atomspace, and it might be slightly more efficient to have one large ListLink. Buit it is clear that guile is allergic to large lists, and seems to choke on them. As-is the current annotation code spends almost half it's CPU time in guile GC which is crazy-huge, and I don't know why its so big. So the MemberLink proposal is good/better, certainly better for Nil.

tanksha commented 4 years ago

@Habush I think the existing parser code can be modified to get group information from the MemberLink (as it was getting from the ListLink previously)

linas commented 4 years ago

The MOZI database I've been loading, from late Dec 2019 has a total of 6.7 million atoms in it -- 6757334 to be precise. The breakdown is:

scheme@(guile-user)> (cog-report-counts)
$2 = ((ConceptNode . 454779) (PredicateNode . 12) 
(ListLink . 1925554) (MemberLink . 1850528) 
(AndLink . 98788) (EvaluationLink . 1887530) 
(InheritanceLink . 122184) (GeneNode . 49050) 
(MoleculeNode . 368909))

so -- half-a-million ConceptNodes, almost 2 million MemberLinks, almost 2 million EvaluationLinks, 400K proteins, 50K genes ... this is using up 3.8GBytes RSS on my system, 4.3 GBytes virtual, so that works out to about 600 bytes/Atom for the MOZI dataset. This will fit in a mid/hi-range laptop from yesteryear, and even a 10-year-old desktop. Or even a modern cellphone or maybe a high-end raspberry-pi, just barely (You'd need 8GB for room for other stuff)

(In my NLP datasets, its more like 1.5KBytes/Atom, but those have a different kind of structure, including more statistics data on each atom.)

So ... 20K more MemberLinks won't change much at all. That would be 1% of all MemberLinks, and 0.3% of all links.

Habush commented 4 years ago

@tanksha is this a still issue? Please see the linked issues

linas commented 4 years ago

pull req #165 that ameliorates this is still unmerged ...

MOZI-AI / annotation-scheme

Avoid the use of ListLink with many outgoing nodes #164