Performance Status - Githubissues

OK, this is not a bug report, but rather a status update (and a request that someone carry out the work mentioned below)

I've queued up 9 pull reqs and 11 open issues .. please deal with these.
After all of the above, I finally reproduced the Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS error when running gene-pathway-annotation. I am currently retrying with a version of bdwgc with ./configure --enable-large-config set to see if that helps.
I'm fairly certain that the Too many heap sections is due to severe memory fragmentation. In particular, the error showed up while guile was using "only" 6GB RAM, of which probably 98% of that is in the atomspace, and not guile itself. ~~(Right now, on a re-run, the process is using 4.1GB RAM, and guile itself has only 26MBytes in the heap, which seems modest enough to me.)~~
~~So RAM consumption per-se does not seem to be an issue, neither by the atomspace or guile.~~ Correction: just a few minutes ago, guile went into GC overdrive. Its now trying to GC every10 seconds or so, and it's using almost all CPU cores to do the GC. This happened after about 5 hours runtime, when everything was steady, stable. top/ps aux shows RAM usage suddenly exploded to 10GB RSS and 19GB virt. Currently (gc-stats) shows 4GB in the guile heap, and 0.5GB of the guile heap free. ~~No clue what/why/where in the code would cause such a monstrous demand on the heap.~~ Update: this is now fixed in #105. I still don't understand the root-cause of the explosion, but the fix does fix it.
~~I'm fairly certain that the excessive usage of deeply nested GroundedSchemaNodes is the root cause of the memory fragmentation. (Update: now I'm not so sure)~~ I am now certain that this is not the issue, at all.

~~I'm going to ponder if there is some really clever way to avoid this. However,~~ It might be worthwhile to pre-emptively redesign the code to be flatter, and less recursive. That means replacing the BindLink/GroundedSchema combination with GetLink and srfi-1 map. Just use GetLink to get lists of things, and use map to apply whatever processing is needed to the results. ~~I'm guessing this will solve this problem, and maybe even improve runtime. Not sure.~~ This is an interesting alternative design to explore, and it might (or might not) make things smaller, faster. I'm not sure.

Here's one problem with GroundedSchema: every time that it is called, the guile subsystem has to be re-entered again, and there is some non-trivial number of microseconds (maybe 50?) needed to do this. It also creates a half-dozen C stack frames, and so you risk over-flowing the C stack, if not careful. Many years ago, the atomspace code was tuned to minimize both of these effects, and so maybe they are not as severe as I remember them being, but still -- its a potential trouble area. Or not. Maybe all this is OK as is.

See also next comment

I proposed in #88 that certain simple searches might run faster if done purely in scheme, andor in new custom C++ code. But the extent of this is completely unknown. I have no idea what the performance difference is. Also, even if the hit is large, we don't know what fraction of the total performance is affected. So, to that end, we need performance profiles. Here's what I got:

I tried linux perf using recepies from http://www.brendangregg.com/perf.html but this does not seem to provide any useful info. It slices and dices the wrong way, for us.
I tried (use-modules (statprof)) from https://www.gnu.org/software/guile/manual/html_node/Statprof.html but promptly got a crash from an interrupted system call. Anyway, even avoiding that, it (once again) does not provide the kind of info that is useful for this project.

So the only solution I see right now is to manually instrument the code. This is kind-of tedious and painful, but should be quite revealing. It should be specific enough to let us know what parts of the system are gobbling up the CPU time (and even memory) and so tuning those can be made a priority. Here's how (see next post)

So here is a small instrumentation tool I just created:

(define (accum-time name)
"
   Maintain performance profile manually.
   Example usage:

      (define actr (accum-time \"foo\"))
      (define bctr (accum-time \"bar\"))
      (actr #:enter? #t) ; start timing foo
      (bctr #:enter? #t) ; start timing bar
      (bctr #:enter? #f) ; stop  timing bar
      (bctr #:enter? #t) ; start timing bar
      (bctr #:enter? #f) ; stop  timing bar
      (bctr #:enter? #t) ; start timing bar
      (bctr #:enter? #f) ; stop  timing bar
      (actr #:enter? #f) ; stop  timing foo
      (actr #:report? #t) ; report foo
      (bctr #:report? #t) ; report bar
"
   (let ((fname name)
         (elapsed 0)
         (calls 0)
         (start 0))
      (lambda* (#:key (enter? #f) (report? #f))
         (if report?
            (if (< 0 calls)
               (format #t
                  "Time: ~9f secs. calls: ~A avg: ~8,1f usec/call for ~A\n"
                  (* 1.0e-9 elapsed)
                  calls
                  (/ (* 1.0e-3 elapsed) calls)
                  fname)
               (format #t "Zero calls to ~A\n" fname))
            (if enter?
               (set! start (get-internal-real-time))
               (begin
                  (set! elapsed (+ elapsed
                     (- (get-internal-real-time) start)))
                  (set! calls (+ calls 1)))))))
)

Here is how I am using it in the code:

(define smpdb-ctr (accum-time "smpdb"))
(define reactome-ctr (accum-time "reactome"))
(define find-pathway-genes-ctr (accum-time "find-pathway-genes"))
(define add-pathway-genes-ctr (accum-time "add-pathway-genes"))
(define find-go-term-ctr (accum-time "find-go-term"))
(define find-memberln-ctr (accum-time "find-memberln"))
(define add-go-info-ctr (accum-time "add-go-info"))
(define find-parent-ctr (accum-time "find-parent"))

and then, for each of these, I did this:

--- a/annotation/functions.scm
+++ b/annotation/functions.scm
@@ -28,7 +28,13 @@
 )

 ;;Given an atom and list of namespaces finds the parents of that atom in the specified namespaces
-(define find-parent
+(define-public (find-parent a b)
+       (find-parent-ctr #:enter? #t)
+       (let ((rv (xfind-parent a b)))
+       (find-parent-ctr #:enter? #f)
+       rv))
+       
+(define xfind-parent

   (lambda (node namespaces)
         (let (
=

It's kind-of tedious to add a bunch of these by hand, so if you have a better idea, do that. Maybe @rekado has some stunt that would be more elegant.

Then run for a while and then say:

(smpdb-ctr #:report? #t)
(reactome-ctr #:report? #t)
(find-pathway-genes-ctr #:report? #t)
(add-pathway-genes-ctr #:report? #t)
(find-go-term-ctr #:report? #t)
(find-memberln-ctr #:report? #t)
(add-go-info-ctr #:report? #t)
(find-parent-ctr #:report? #t)

to get usage reports. If things are still running, recall that you can use the cogserver to get addtional guile prompts:

(use-modules (opencog cogserver))
(start-cogserver)

and then

rlwrap telnet localhost 17001

BTW, @rekado if you are reading this, the reason we have a cogserver instead to using the default guile network server is twofold: (1) the cogserver is about 50x faster than the guile repl server; this makes a huge difference for moving around data, e.g. in robotics. (2) the guile network server is buggy and crashy. It will typically crash after an hour or two, which is completely unacceptable if the compute job has to stay up for months at a time. That's why the cogserver exists. I tried to get rid of it, but no luck. (well, reason 3 is that t also provides a network shell for python, which python itself does not have).

Some example results from (gene-pathway-annotation (list "MRPL27") "foobar" #:pathway "reactome smpdb" #:parents 0):

Time: 0.0373287 secs. calls: 1 avg:  37328.7 usec/call for smpdb
Time: 94.282941 secs. calls: 1 avg: 94282940.8 usec/call for reactome
Time: 9.7742674 secs. calls: 3 avg: 3258089.1 usec/call for find-pathway-genes
Time: 9.2660640 secs. calls: 267 avg:  34704.4 usec/call for add-pathway-genes

So, looks like reactome is eating CPU, but not clear where. OK, new and improved:

Time: 0.0249174 secs. calls: 1 avg:  24917.4 usec/call for smpdb
Time: 86.396381 secs. calls: 1 avg: 86396381.1 usec/call for reactome
Time: 9.1681212 secs. calls: 3 avg: 3056040.4 usec/call for find-pathway-genes
Time: 8.6186269 secs. calls: 267 avg:  32279.5 usec/call for add-pathway-genes
Zero calls to find-go-term
Zero calls to find-memberln
Zero calls to add-go-info
Zero calls to find-parent
Time: 0.0075673 secs. calls: 2 avg:   3783.7 usec/call for find-pathway-member
Time: 75.875504 secs. calls: 3 avg: 25291834.8 usec/call for pathway-gene-interactors
Time: 0.0034641 secs. calls: 3 avg:   1154.7 usec/call for pathway-hierarchy
Time: 0.0538150 secs. calls: 2 avg:  26907.5 usec/call for find-protein
Zero calls to find-protein-form
Time: 1.2777262 secs. calls: 6 avg: 212954.4 usec/call for find-mol

so now its pathway-gene-interactors .. what is it doing? It's calling generate-interactors. And what is that doing? And the answer is ...

.. untracable. It's written as a call/cc so this style of tracing won't work. (Surely there is a simpler and easier way of doing implementing without call/cc? Can't we just keep track of state explicitly, viz. write a simple object with needed internal state?

Anyway reading the source, we see that it calls biogrid-pairs-pathway which is equally untracable. It seems to be something out of ~~the json library~~ (Update Not the json library, but something else???). The conclusion is then that 88% of the time -- 0.878225489560726 = (/ 75.875504 86.396381) is spent not in the atomspace, not in any of the (annotation) guile modules, but in ~~the json library~~ this other thing.

To re-iterate the last remark ... looks like 88% of all CPU time is spent in ~~the json library~~ this other thing. I wouldn't be surprised if that is where all the memory is leaking too, because I just don't see any memory-gobbling code anywhere in the (annotation) module. It has warts and inefficiencies, but nothing quite like that.

FYI, I ran five more heavyweight genes: HSP90B1 AP1G1 PPIB NDUFAF7 RBM5, and some lightweights I didn't write down. The cumulative results are below:

Time: 258.70370 secs. calls: 10 avg: 25870369.8 usec/call for smpdb
Time: 561.16087 secs. calls: 10 avg: 56116087.5 usec/call for reactome
Time: 137.29761 secs. calls: 31 avg: 4428955.1 usec/call for find-pathway-genes
Time: 126.09734 secs. calls: 1668 avg:  75597.9 usec/call for add-pathway-genes
Zero calls to find-go-term
Zero calls to find-memberln
Zero calls to add-go-info
Zero calls to find-parent
Time: 0.0789199 secs. calls: 20 avg:   3946.0 usec/call for find-pathway-member
Time: 591.78595 secs. calls: 31 avg: 19089869.3 usec/call for pathway-gene-interactors
Time: 0.0199050 secs. calls: 16 avg:   1244.1 usec/call for pathway-hierarchy
Time: 2.1588711 secs. calls: 20 avg: 107943.6 usec/call for find-protein
Zero calls to find-protein-form
Time: 88.405886 secs. calls: 62 avg: 1425901.4 usec/call for find-mol
Time: 5.6221634 secs. calls: 4943 avg:   1137.4 usec/call for find-pubmed-id

As before, its clear that find-pathway-interactors is eating up most of the CPU time. It appears that add-pathway-genes and find-mol are also significant CPU time, I believe that for these two, time its all lost in the pattern-matcher.

Update: last night, I assumed that make-parameter was a part of json. This morning, I searched for make-parameter and discovered it was part of guile. So: new conclusion: It's not json that is causing trouble, its the combination of the call/cc plus the use of make-parameter that is creating an explosive gobbling up of RAM and CPU. That is, the call/cc is probably creating zillions of continuations, keeping them around indefinitely, eating all RAM and cycles.

Now, the obvious thing becomes: can one redesign to not use call/cc? Continuations are a rather olde-fashioned way of implementing what are called "objects" in the C++ world. It's quite easy to write objects in scheme, and you don't need goops to do that. For example: here is a simple object with two methods:

(define (counter)
    (let ((cur-val 0))
        (lambda (method)
            (if  (eq? method 'incr) (set! cur-val (+ cur-val 1)))
            (if (eq? method 'decr) (set! cur-val (- cur-val 1)))
            (if (eq? method 'print) (format #t "The current value is ~A\n" cur-val))
      )))

(define my-counter  (counter))
(define another-one (counter))
(my-counter 'incr)
(my-counter 'incr)
(another-one 'decr)
(my-counter 'print)
(another-one 'print)

I'm proposing that the call/cc can be replaced by something like this, thus explicitly controlling the state management in the system, and limiting the total RAM usage. The point being that here, the continuation state is very simple, it's just an int called cur-val -- its small, its explicitly managed, and unlike call/cc, there is copy any kind of large, long, complex recursive execution state.

Parameters are like global variables. They should be used sparingly. Objects could also be implemented with Guile records, which is a little nicer than plain closures. There are only very few situations where a full call/cc is the right tool. Even when you need continuations it's better to use delimited continuations (e.g. prompts).

So yeah: what's the purpose of using call/cc here?

Generally, the code leans heavily on mutation and global state, both of which are unidiomatic in Scheme. It seems to me that most if not all cases of mutation could be removed. Parameters should probably be replaced with either an explicit configuration record / alist or other explicit arguments.

After some sleep and more time to think, here's a collection of performance tuning suggestions:

Use gcc version 7.4 or newer; this provides a 2x performance improvement over gcc-5.4 in the pattern matcher performance. The difference is just immense. In particular, Ubuntu 16.04 LTS has the old, slow version, so it's critically important to run either ubuntu-18.04 or Debian stable (or a modern version of redhat/fedora/suse, etc.)
Start guile as HUGETLB_MORECORE=yes LD_PRELOAD=/usr/lib/libhugetlbfs.so.0 guile -- this can offer a 20% to 40% improvement on AtomSpace performance (in general, I have not measured the effect for this code) The above also requires that hugepages are configured correctly.
Get the latest atomspace code. In particular, opencog/atomspace#2464, #2466, #2469 combined provide another 20% to 30% (based on other benchmarks; I have not measured here.) Pull req opencog/atomspace#2470 improves biogrid searches by 25%.
The performance of file-loading can be improved based on the comment made here: https://github.com/MOZI-AI/annotation-scheme/issues/93#issuecomment-571388707 - basically, it would be faster to load the atomese datasets by writing a C++ loader, and skipping guile entirely.
I've got a wild idea for changes to the Pattern matcher that could help pathway-gene-interactors by some fair bit; however, the idea is complex and will require significant effort. Ping me for status.
I kind of don't want to mention this, cause it's problematic. Try the patch below: it improves the wall-clock time by 20% for gene-go-annotation (at the expense of using 2 cpu cores)
```
--- a/annotation/gene-go.scm
+++ b/annotation/gene-go.scm
@@ -28,9 +28,12 @@
 #:export (gene-go-annotation)
)
```

+(use-modules (ice-9 threads)) + (define* (gene-go-annotation gene-nodes file-name #:key (namespace "biological_process molecular_function cellular_component") (parents 0) (protein "True")) (let (

[result (flatten (map (lambda (gene)
[result (flatten (n-par-map 2 ; as opposed to just map.

(lambda (gene) (if (equal? protein "True") (ListLink (find-go-term (GeneNode gene) (string-split namespace #\ ) parents)


Why such a pitiful improvement for 2x cpu use? I'm not sure. I've fiddled with guile threading in other contexts, and its hard to get more than 2x or 3x improvement from guile threads. There's some kind of severe lock-contention/thread-thrashing when more than 3 threads are used. I have not been able to locate the root cause of this, but I strongly suspect that it's an issue deep in guile, rather than contention being in the atomspace.

Also, (I believe) that using set! is not manifestly thread-safe in guile, which is one more reason to rewrite code to be tail-recursive (or use fold or use map): those are manifestly thread-safe.

@linas I removed the call/cc usage completely, please check the PR

Generally, the code leans heavily on mutation and global state, both of which are unidiomatic in Scheme. It seems to me that most if not all cases of mutation could be removed.

Yes, it is plagued with mutation. We wrote it like that initially because A) the code is used from python as submodule and has to maintain the state of requests B) While writing this code it was our first experience with Scheme and we had an OOP background.

Removing the mutation for the most part shouldn't be an issue ( I will send PRs). But it is difficult for the parser which has to maintain some state while parsing the scheme(Atomese?) to json.

while parsing the scheme(Atomese?) to json.

Why do you need to parse anything at all? If you need to write json to a pipe or to a file, or whatever, can't you just write it directly? At any rate, none of the code that I looked at here requires json to do anything, so it's not clear why its even spec'ed in this repo.

call/cc

I just now tried the patch #101 and am measuring gene-pathway-annotation I get the following instrumentation:

Time: 43.445660 secs. calls: 12 avg: 3620471.7 usec/call for smpdb
Time: 2534.6847 secs. calls: 13 avg: 194975746.5 usec/call for reactome
Time: 20.328536 secs. calls: 58 avg: 350492.0 usec/call for find-pathway-genes
Time: 19.555418 secs. calls: 3771 avg:   5185.7 usec/call for add-pathway-genes
Zero calls to find-go-term
Zero calls to find-memberln
Zero calls to add-go-info
Zero calls to find-parent
Time: 0.1160833 secs. calls: 26 avg:   4464.7 usec/call for find-pathway-member
Time: 3546.3087 secs. calls: 57 avg: 62215941.9 usec/call for pathway-gene-interactors
Zero calls to generate-result
Time: 1870.1507 secs. calls: 81483 avg:  22951.4 usec/call for generate-interactors
Zero calls to build-interaction
Time: 0.3078500 secs. calls: 43 avg:   7159.3 usec/call for pathway-hierarchy
Time: 3.3938785 secs. calls: 25 avg: 135755.1 usec/call for find-protein
Zero calls to find-protein-form
Time: 35.932332 secs. calls: 116 avg: 309761.5 usec/call for find-mol
Time: 6.0779556 secs. calls: 27301 avg:    222.6 usec/call for find-name
Time: 11.885580 secs. calls: 11144 avg:   1066.5 usec/call for find-pubmed-id

The total CPU time, as reported by ps aux is 5300 seconds (about 89 minutes) of the 5300, approx 750 was spent loading the dataset, so actually, only 4550 seconds actually processing. GC has run for 1046 seconds. Of this, there were 500 GC seconds lost in file loading, so the other 546 of GC was lost during processing. This leaves (5300 - 750 - 550) = 4000 seconds spent in actual pathway annotation. The instrumentation found 3550 seconds in pathway-gene-interactors and I'm unclear as to where the remaining 400 or 500 seconds were lost.

So pathway-gene-interactors is actually very interesting: it is a non-trivial pattern search, exactly the kind of non-trivial search for which the pattern matcher was created for.

Curiously, though, of that 3550 seconds, more than half was spent in generate-interactors and, reading through that code, it does not do anything at all fancy or complicated. So that's troubling.

Now that I understand what make-parameter does, and that biogrid-pairs-pathway is an instance of that, I would like to suggest that there might be a better, faster substitute: use (make-atom-set) from the core opencog module. I think that (make-atom-set) is (almost) a drop-in replacement for make-parameter except that I'm guessing its faster, because it is fine-tuned for atoms. (I'm not sure if its an exact drop-in replacement; I wasn't aware of make-parameter before. The general, overall idea seems to be the same, though. and make-atom-set can only ever hold atoms, and nothing else.).

More analysis of gene-pathway-annotation I've now run this 4 times with parents=0 and each time, it calmly proceeds at 100% CPU for about 2 hours, when RAM usage starts to grow, slowly at first, and then ever-faster. Examining (gc-stats) shows that much or most of this growth is on the guile side, not the AtomSpace side.

To narrow this down, I've added print statements to gene-pathway-annotation to provide a report after it finishes each gene. Most genes take zero or one seconds; a few take 30 or 60 seconds, and maybe one out of 10 takes five minutes. This is not surprising. I bet if I graphed this, I'd get a Zipf distribution. It would actually be an interesting graph to look at and contemplate, since it gives you a good idea of where the biological complexity lies, and where to focus software development effort. I'm not just saying this to intellecutalize: this is an actual software development technique you can apply. Nature is Zipfian. You have to be prepared for that.

Anyway, I'm monitoring the CPU times, and the length of two lists: the result list and the pwlst list. For example, doing the reactome for gene RPS20 took over half an hour (2095 seconds)(!), and caused (length result) to grow to 17812 from 9882 As it did so, the guile heap size grew from 19MBytes to 26MBytes. Also, gc ran 7K times during that half-hour. (Prior to this point, gc had only run about 1K times) There were 270 seconds spent in GC, or about 270/2095 = 13% of the CPU time. (The smbdb part for the same gene took 45 minutes, with a similar bump for gc)

Now, RPS20 is only number 13 out of 681. So I haven't even gotten through 2% of the grand-total number of genes. .. and it has taken 60 minutes of CPU time to get this far. So estimated run-time is fifty hours. That's a lot. If the length of result grows linearly like this, then it will be about 1 million long, by the end. (exept Zipfian growth isn't really linear, so I'm not sure). Based on the gc behavior so far, I'm guessing that this will also be the cause of the guile heap explosion (which is only an hour or two away, based on prior experience -- and not fifty hours away.)

If I read the code in gene-pathway.scm, it seems that the only reason we need result is so that we can create one monster-sized atom: (ListLink result) https://github.com/MOZI-AI/annotation-scheme/blob/57fbdfe5d6f11af31af43489d979865ecdab8082/annotation/gene-pathway.scm#L54 which is then written to a file. Placing such a monster-sized Atom into the AtomSpace is .. possible, certainly, but is going to be slow and wasteful, given that it is only used to write things out to a file, and then ignored and forgotten. It's pointless bloat. If/when that file is ever read in again, the converse holds true: you will have one or two or a handful of pointlessly-huge atoms, which are unreferenced, ignored and forgotten, eating up RAM, but not being useful in any way.

Given that this also seems to be smashing the guile heap, I'm thinking that a solution here would be to write results to a pipe or file(*), as they are obtained, so that they don't sit in the guile heap. This alone might be enough to avoid the Increase MAXHINCR or MAX_HEAP_SECTS issue.

Footnote (*): If one wanted to get fancy, one could create a thread-safe QueueValue as discussed elsewhere, and then specialize it so that one end of the queue was a pipe or a file. So, just stuff results, as they are obtained, into the QueueValue, and bingo - they then pop out wherever they are meant to go. So QueueValues have been discussed for other apps, but attaching them to a pipe/file is a new (and fun) idea.

OK, so ... 140 minutes in, the guile heap has started to grow more rapidly, we're now on gene 27 of 681, the guile heap is now 670MBytes .. but the length of result is only 30K so it is not that much larger: not enough to account for the enlarging heap. Now, pwlen is 27 - so that is very modest, but it seems like the heap grows by 20MB or 50MB or 200MB every time that pwlen grows by one. So what's going on there? Another hypotheses: does this have something to do with make-parameter, again?

OK so at 166 min (cpu time, not wall-clock time) RAM usage is exploding. guile heap is 3GBytes as I write this, and growing. The gene being processed is ERCC8, number 28 of 681. The atomspace is only 3% larger than it was at the beginning, so effectively all of the RAM explosion is the guile heap.

six minutes later, wall-clock time, guile heap is at 6GBytes... Its now growing by a gig or two every 5 minutes, and based on past experience, this will continue until OOM. ~~Unclear if this is due to ERCC8 only, or if~~ its a general systemic thing. (Update: its a systemic thing; doing ERCC8 all by itself does not cause any problems.) ~~Completely wild and unfounded guess, but this might be due to #103 ...~~ (Update: Nope. See #103 for details; an experimental patch does not solve the explosion.)

During the explosion (now 21GB RSS) I have 5215 seconds spent in the code (not counting GC) but instrumentation shows:

Time: 1142.1741 secs. calls: 27 avg: 42302744.5 usec/call for smpdb
Time: 1814.1401 secs. calls: 27 avg: 67190375.5 usec/call for reactome
Time: 2779.7014 secs. calls: 104 avg: 26727897.6 usec/call for pathway-gene-interactors
Time: 2223.7673 secs. calls: 75 avg: 29650230.0 usec/call for pathway-hierarchy

In the past, the time in pathway-hierarchy was miniscule, but now its accounting for half. Maybe that is where the explosion is? Yes, it is. Before explosion:

Time: 152.94239 secs. calls: 74 avg: 2066789.0 usec/call for pathway-hierarchy
Time: 65.234275 secs. calls: 101 avg: 645883.9 usec/call for check-pathway

During explosion

Time: 2202.2932 secs. calls: 75 avg: 29363909.6 usec/call for pathway-hierarchy
Time: 2570.8179 secs. calls: 104 avg: 24719402.9 usec/call for check-pathway

Nothing else is changing: only these two. One single call to pathway-hierarchy racked up more than half-an-hour, and 3 calls to check-pathway racked up almost 45 minutes, and nothing else was going on on the system. ~~So this is now the focus.~~

Update: The explosion is fixed in #105 I do not understand why the original code was exploding.

After #105, I am now measuring:

Time: 0.2827939 secs. calls: 235 avg:   1203.4 usec/call for pathway-hierarchy
Time: 0.0124460 secs. calls: 327 avg:     38.1 usec/call for check-pathway

Yep, it all looks under control now. so far, done 252 of 681, and wall-clock-time=cpu=time=23259 seconds, and instrumentation shows:

Time: 4516.1446 secs. calls: 251 avg: 17992607.8 usec/call for smpdb
Time: 18734.084 secs. calls: 251 avg: 74637783.6 usec/call for reactome
Time: 703.77133 secs. calls: 1207 avg: 583074.8 usec/call for find-pathway-genes
Time: 676.99826 secs. calls: 70435 avg:   9611.7 usec/call for add-pathway-genes
Time: 1.7678717 secs. calls: 503 avg:   3514.7 usec/call for find-pathway-member
Time: 21444.577 secs. calls: 1206 avg: 17781572.8 usec/call for pathway-gene-interactors
Time: 55.257491 secs. calls: 887597 avg:     62.3 usec/call for generate-interactors
Time: 1.1757659 secs. calls: 860 avg:   1367.2 usec/call for pathway-hierarchy
Time: 0.1137129 secs. calls: 1200 avg:     94.8 usec/call for check-pathway
Time: 46.105715 secs. calls: 502 avg:  91844.1 usec/call for find-protein
Time: 1051.8928 secs. calls: 2413 avg: 435927.4 usec/call for find-mol
Time: 101.48543 secs. calls: 495541 avg:    204.8 usec/call for find-name
Time: 27.638528 secs. calls: 33525 avg:    824.4 usec/call for find-pubmed-id

RSS holding stead and 3.9GB; guile heap at 74MBytes. It appears that pathway-gene-interactors eats 92% of the cpu time. This is mostly a well-structured BindLink, and, as mentioned elsewhere, might benefit from altering the pattern matcher to use more tricks from SAT. e.g. clause pruning.

For all 681 genes, I get:

Time: 11062.356 secs. calls: 681 avg: 16244281.7 usec/call for smpdb
Time: 50644.088 secs. calls: 681 avg: 74367236.4 usec/call for reactome
Time: 2523.0404 secs. calls: 3769 avg: 669419.0 usec/call for find-pathway-genes
Time: 2382.0801 secs. calls: 183670 avg:  12969.3 usec/call for add-pathway-genes
Time: 5.3277335 secs. calls: 1362 avg:   3911.7 usec/call for find-pathway-member
Time: 53795.878 secs. calls: 3769 avg: 14273249.7 usec/call for pathway-gene-interactors
Time: 100.53601 secs. calls: 2133970 avg:     47.1 usec/call for generate-interactors
Time: 1.4259363 secs. calls: 2197 avg:    649.0 usec/call for pathway-hierarchy
Time: 0.2986100 secs. calls: 3080 avg:     97.0 usec/call for check-pathway
Time: 60.727192 secs. calls: 1362 avg:  44586.8 usec/call for find-protein
Time: 5315.8542 secs. calls: 7538 avg: 705207.5 usec/call for find-mol
Time: 232.81251 secs. calls: 1269787 avg:    183.3 usec/call for find-name
Time: 32.682447 secs. calls: 37884 avg:    862.7 usec/call for find-pubmed-id
Time: 6655.7199 secs. calls: 250256 avg:  26595.6 usec/call for locate-node
Time: 51.833697 secs. calls: 165299 avg:    313.6 usec/call for add-loc

Total runtime = 61810 seconds (17+ hours). The caching tweak opencog/atomspace#2470 provides no improvement at all (a surprise...!??)

The function pathway-gene-interactors appears to be called ~~twice~~ several times for every gene. Since it accounts for ~~almost all~~ 60% of CPU time, performance can be ~~almost doubled~~ more than tripled by caching the results from the first call. See pull req #108

After #108 and after #2471 I get a run-time of 14382 seconds, below. It looks like locate-node and find-mol are now bottlenecks, will run again with memoized version of locate-node .. whereas find-mol just needs a major redesign, it looks to be just plain poorly designed.

(gene-pathway-annotation gene-list  "path-results"
      #:pathway "reactome smpdb"
      #:namespace ""
      #:parents 0)
Time: 5523.3938 secs. calls: 681 avg: 8110710.4 usec/call for smpdb
Time: 8711.0721 secs. calls: 681 avg: 12791588.9 usec/call for reactome
Time: 1239.0317 secs. calls: 3769 avg: 328742.8 usec/call for find-pathway-genes
Time: 1090.1401 secs. calls: 183670 avg:   5935.3 usec/call for add-pathway-genes
Time: 5.6236221 secs. calls: 1362 avg:   4128.9 usec/call for find-pathway-member
Time: 10241.141 secs. calls: 1992 avg: 5141134.9 usec/call for pathway-gene-interactors
Time: 44.340495 secs. calls: 326140 avg:    136.0 usec/call for generate-interactors
Time: 1.2863656 secs. calls: 2197 avg:    585.5 usec/call for pathway-hierarchy
Time: 0.3182439 secs. calls: 3080 avg:    103.3 usec/call for check-pathway
Time: 7.9706777 secs. calls: 1362 avg:   5852.2 usec/call for find-protein
Time: 2732.6366 secs. calls: 7538 avg: 362514.8 usec/call for find-mol
Time: 235.05631 secs. calls: 1269787 avg:    185.1 usec/call for find-name
Time: 31.868447 secs. calls: 37884 avg:    841.2 usec/call for find-pubmed-id
Time: 2536.5483 secs. calls: 250256 avg:  10135.8 usec/call for locate-node
Time: 54.564485 secs. calls: 165299 avg:    330.1 usec/call for add-loc

Patch #113 provides an additional 1.4x perf improvement; it decimates the outrageous amount of cpu time spent in locate-node, dropping it by a factor of over 100x. The n latest runtime is 10554 seconds wall-clock:

Time: 2666.1765 secs. calls: 681 avg: 3915090.3 usec/call for smpdb
Time: 7748.8657 secs. calls: 681 avg: 11378657.4 usec/call for reactome
Time: 145.61418 secs. calls: 3769 avg:  38634.7 usec/call for find-pathway-genes
Time: 25.489879 secs. calls: 183670 avg:    138.8 usec/call for add-pathway-genes
Time: 3.7654062 secs. calls: 1362 avg:   2764.6 usec/call for find-pathway-member
Time: 10092.041 secs. calls: 1992 avg: 5066285.5 usec/call for pathway-gene-interactors
Time: 52.768517 secs. calls: 387813 avg:    136.1 usec/call for generate-interactors
Time: 1.0577963 secs. calls: 2197 avg:    481.5 usec/call for pathway-hierarchy
Time: 0.2858599 secs. calls: 3080 avg:     92.8 usec/call for check-pathway
Time: 6.2650159 secs. calls: 1362 avg:   4599.9 usec/call for find-protein
Time: 160.83718 secs. calls: 7538 avg:  21336.9 usec/call for find-mol
Time: 0.4193283 secs. calls: 1536 avg:    273.0 usec/call for find-name
Time: 38.692187 secs. calls: 37884 avg:   1021.3 usec/call for find-pubmed-id
Time: 19.413556 secs. calls: 10578 avg:   1835.3 usec/call for locate-node
Time: 50.028115 secs. calls: 165299 avg:    302.7 usec/call for add-loc

Prior to #2471 I am getting these results for pathway+namespaces: 213324 seconds elapsed (that's 59 hours). This is with patch #108 applied. The guile heap was stable at 150MBytes until the very end, but then blew up to 16GB during the file-write. There is something deeply pathological about the file-write. See issue #111

(gene-pathway-annotation gene-list  "my-path-results"
      #:pathway "reactome smpdb"
      #:namespace "biological_process molecular_function cellular_component"
      #:parents 0)
Time: 38572.841 secs. calls: 681 avg: 56641469.2 usec/call for smpdb
Time: 174376.03 secs. calls: 681 avg: 256058777.1 usec/call for reactome
Time: 195778.03 secs. calls: 3769 avg: 51944290.1 usec/call for find-pathway-genes
Time: 195630.63 secs. calls: 183670 avg: 1065120.2 usec/call for add-pathway-genes
Time: 6.0805688 secs. calls: 1362 avg:   4464.4 usec/call for find-pathway-member
Time: 11552.666 secs. calls: 1992 avg: 5799531.2 usec/call for pathway-gene-interactors
Time: 47.519931 secs. calls: 326140 avg:    145.7 usec/call for generate-interactors
Time: 1.6497263 secs. calls: 2197 avg:    750.9 usec/call for pathway-hierarchy
Time: 0.2986489 secs. calls: 3080 avg:     97.0 usec/call for check-pathway
Time: 61.086829 secs. calls: 1362 avg:  44850.8 usec/call for find-protein
Time: 5542.0427 secs. calls: 7538 avg: 735213.9 usec/call for find-mol
Time: 5362.4040 secs. calls: 183670 avg:  29195.9 usec/call for find-go-term
Time: 172771.66 secs. calls: 1444979 avg: 119566.9 usec/call for find-name
Time: 32.630861 secs. calls: 37884 avg:    861.3 usec/call for find-pubmed-id
Time: 5161.0680 secs. calls: 183670 avg:  28099.7 usec/call for find-memberln
Time: 2523.4792 secs. calls: 4934750 avg:    511.4 usec/call for add-go-info
Zero calls to find-parent
Time: 194719.85 secs. calls: 250256 avg: 778082.7 usec/call for locate-node
Time: 58.992121 secs. calls: 165299 avg:    356.9 usec/call for add-loc

After applying patch #113 and also #2471, I get a 13x speedup, from 2.5 days to 4.6 hours (16569 seconds, actually) with the following profile. Note that #113 memoizes locate-node and thus decimates execution time for that. I'm sure that aggressive memoization of many other pathways will improve performance by additional factors.

Time: 3854.4545 secs. calls: 681 avg: 5659992.0 usec/call for smpdb
Time: 12306.061 secs. calls: 681 avg: 18070573.9 usec/call for reactome
Time: 4497.3962 secs. calls: 3769 avg: 1193259.8 usec/call for find-pathway-genes
Time: 4361.5304 secs. calls: 183670 avg:  23746.6 usec/call for add-pathway-genes
Time: 6.7424449 secs. calls: 1362 avg:   4950.4 usec/call for find-pathway-member
Time: 11454.457 secs. calls: 1992 avg: 5750229.3 usec/call for pathway-gene-interactors
Time: 58.174262 secs. calls: 387543 avg:    150.1 usec/call for generate-interactors
Time: 1.4399045 secs. calls: 2197 avg:    655.4 usec/call for pathway-hierarchy
Time: 0.3013116 secs. calls: 3080 avg:     97.8 usec/call for check-pathway
Time: 6.9740637 secs. calls: 1362 avg:   5120.5 usec/call for find-protein
Time: 185.13425 secs. calls: 7538 avg:  24560.1 usec/call for find-mol
Time: 4315.0330 secs. calls: 183670 avg:  23493.4 usec/call for find-go-term
Time: 2.1543851 secs. calls: 1537 avg:   1401.7 usec/call for find-name
Time: 40.919641 secs. calls: 37884 avg:   1080.1 usec/call for find-pubmed-id
Time: 4281.8303 secs. calls: 183670 avg:  23312.6 usec/call for find-memberln
Time: 2596.3717 secs. calls: 4934750 avg:    526.1 usec/call for add-go-info
Zero calls to find-parent
Time: 24.414441 secs. calls: 10578 avg:   2308.0 usec/call for locate-node
Time: 57.044634 secs. calls: 165299 avg:    345.1 usec/call for add-loc

So biogrid-interaction-annotation completed without any failures. It took 26.4 wall-clock hours to run, or 30.25 CPU-hours. Of this, 7.4 hours was in guile GC. At the conclusion, the guile heap was 424MB of which 294MB was free. At the conclusion, RSS=4.6GB, and Virt=9.1GB RAM. The final atomspace contained 9M atoms, which is about double the starting size of 4.6M Atoms. The output file was 250MBytes in size (uncompressed).

Conclusions: RAM usage seems reasonable, the GC usage is a bit outrageous, I suspect it's due to results being saved up in very long lists (as witnessed by most of the heap being free, once it completes). Total CPU times seems outrageous; don't know where it went. Will instrument and measure.

At wall-clock time of appox 3022 seconds, the Instrumented code shows this profile

Time: 268.29849 secs. calls: 75 avg: 3577313.2 usec/call for match-gene-interactors
Time: 2945.8490 secs. calls: 75 avg: 39277986.8 usec/call for find-output-interactors
Time: 2067.2802 secs. calls: 46366 avg:  44586.1 usec/call for generate-result
Time: 8.2950412 secs. calls: 92733 avg:     89.5 usec/call for build-interaction
Time: 259.28143 secs. calls: 95765 avg:   2707.5 usec/call for find-protein-form
Time: 0.2283316 secs. calls: 256 avg:    891.9 usec/call for find-name
Time: 61.317968 secs. calls: 46367 avg:   1322.4 usec/call for find-pubmed-id

so the instrumentation is capturing almost all CPU cycles. Notable is that generate-result is eating 2/3rds of all cycles. Looking at this code reveals a structure similar to the old implementation of generate-interactors, relying on a linear search of make-parameter. This suggests that a fix similar to that of #103 will also fix things here.

Much of the remaining time is lost in find-output-interactorswhich has the interesting-type pattern search, and as mentioned elsewhere, I have ideas for fixing that.

After patching (see issue #106), performance more than doubles:

Time: 34.925137 secs. calls: 75 avg: 465668.5 usec/call for match-gene-interactors
Time: 1409.3709 secs. calls: 75 avg: 18791612.0 usec/call for find-output-interactors
Time: 330.18767 secs. calls: 45971 avg:   7182.5 usec/call for generate-result
Time: 7.3280768 secs. calls: 91943 avg:     79.7 usec/call for build-interaction
Time: 254.12882 secs. calls: 94881 avg:   2678.4 usec/call for find-protein-form
Time: 0.0568310 secs. calls: 247 avg:    230.1 usec/call for find-name
Time: 57.894680 secs. calls: 45972 avg:   1259.3 usec/call for find-pubmed-id

wall-clock time 1447 seconds for the first 75 genes.

To do all 681 (for protein interaction):

(biogrid-interaction-annotation gene-list  "my-biogrid-results"
      #:namespace "biological_process molecular_function cellular_component"
      #:interaction "Proteins"
      #:parents 0)
Time: 320.42858 secs. calls: 681 avg: 470526.6 usec/call for match-gene-interactors
Time: 14377.118 secs. calls: 681 avg: 21111774.1 usec/call for find-output-interactors
Time: 4280.8901 secs. calls: 506787 avg:   8447.1 usec/call for generate-result
Time: 68.353809 secs. calls: 1013574 avg:     67.4 usec/call for build-interaction
Time: 3567.3805 secs. calls: 1023015 avg:   3487.1 usec/call for find-protein-form
Time: 0.1304556 secs. calls: 561 avg:    232.5 usec/call for find-name
Time: 582.39830 secs. calls: 506787 avg:   1149.2 usec/call for find-pubmed-id

To do all 681 (for protein interaction), after opencog/atomspace#2471, the performance more than triples (3.2x faster), tot total wall-clock time of 4095 seconds for all 681 genes, below. All new bottlenecks now emerge: generate-result which bottlenecks on find-protein-form which takes 2946/4095=72% of the total Also, GC takes 766 seconds, so GC is now a large part of the overhead.

(biogrid-interaction-annotation gene-list  "my-biogrid-results"
      #:namespace "biological_process molecular_function cellular_component"
      #:interaction "Proteins"
      #:parents 0)
Time: 301.44206 secs. calls: 681 avg: 442646.2 usec/call for match-gene-interactors
Time: 3766.9675 secs. calls: 681 avg: 5531523.5 usec/call for find-output-interactors
Time: 3682.5601 secs. calls: 506787 avg:   7266.5 usec/call for generate-result
Time: 70.670652 secs. calls: 1013574 avg:     69.7 usec/call for build-interaction
Time: 2946.7653 secs. calls: 1023015 avg:   2880.5 usec/call for find-protein-form
Time: 0.1185459 secs. calls: 561 avg:    211.3 usec/call for find-name
Time: 603.00910 secs. calls: 506787 avg:   1189.9 usec/call for find-pubmed-id
Zero calls to find-memberln
Zero calls to add-go-info
Zero calls to find-parent
Time: 5.8014628 secs. calls: 9003 avg:    644.4 usec/call for locate-node
Zero calls to add-loc

To do all 681 (for gene interaction, nparents=0) prior to pull req #2470, #2471:

(biogrid-interaction-annotation gene-list  "my-biogrid-results"
      #:namespace "biological_process molecular_function cellular_component"
      #:interaction "Genes"
      #:parents 0)
Time: 243.10967 secs. calls: 681 avg: 356989.2 usec/call for match-gene-interactors
Time: 11622.276 secs. calls: 681 avg: 17066485.2 usec/call for find-output-interactors
Time: 840.91115 secs. calls: 506787 avg:   1659.3 usec/call for generate-result
Time: 27.270739 secs. calls: 506787 avg:     53.8 usec/call for build-interaction
Zero calls to find-protein-form
Time: 11.672809 secs. calls: 47096 avg:    247.9 usec/call for find-name
Time: 589.23931 secs. calls: 506787 avg:   1162.7 usec/call for find-pubmed-id
Time: 150.11183 secs. calls: 9441 avg:  15900.0 usec/call for find-go-term
Time: 143.25006 secs. calls: 9441 avg:  15173.2 usec/call for find-memberln
Time: 99.247953 secs. calls: 165392 avg:    600.1 usec/call for add-go-info

All 681(genes, nparents=1) took 1.5x longer than above, and the profile is different:

(biogrid-interaction-annotation gene-list  "my-biogrid-results"
      #:namespace "biological_process molecular_function cellular_component"
      #:interaction "Genes"
      #:parents 1)
Time: 5726.3430 secs. calls: 681 avg: 8408726.8 usec/call for match-gene-interactors
Time: 11767.411 secs. calls: 681 avg: 17279605.1 usec/call for find-output-interactors
Time: 6330.9327 secs. calls: 506787 avg:  12492.3 usec/call for generate-result
Time: 28.161058 secs. calls: 506787 avg:     55.6 usec/call for build-interaction
Zero calls to find-protein-form
Time: 11.677343 secs. calls: 47096 avg:    247.9 usec/call for find-name
Time: 598.22882 secs. calls: 506787 avg:   1180.4 usec/call for find-pubmed-id
Time: 5625.8787 secs. calls: 9441 avg: 595898.6 usec/call for find-go-term
Time: 146.04930 secs. calls: 9441 avg:  15469.7 usec/call for find-memberln
Time: 247.59138 secs. calls: 406306 avg:    609.4 usec/call for add-go-info
Time: 5445.4115 secs. calls: 165392 avg:  32924.3 usec/call for find-parent
Time: 30.441913 secs. calls: 9441 avg:   3224.4 usec/call for locate-node
Zero calls to add-loc

After pull req #2471, the biogrid gene annotation is nearly 10x faster -- 1204 seconds vs 11887 seconds Once again, total time spent in guile GC is starting to make a large hit to overall performance. All 681 genes is now:

(biogrid-interaction-annotation gene-list  "my-biogrid-results"
      #:namespace "biological_process molecular_function cellular_component"
      #:interaction "Genes"
      #:parents 0)
Time: 140.83417 secs. calls: 9441 avg:  14917.3 usec/call for find-go-term
Time: 235.42098 secs. calls: 681 avg: 345698.9 usec/call for match-gene-interactors
Time: 943.35577 secs. calls: 681 avg: 1385250.8 usec/call for find-output-interactors
Time: 834.82299 secs. calls: 506787 avg:   1647.3 usec/call for generate-result
Time: 27.197969 secs. calls: 506787 avg:     53.7 usec/call for build-interaction
Zero calls to find-protein-form
Time: 11.202908 secs. calls: 47096 avg:    237.9 usec/call for find-name
Time: 600.92915 secs. calls: 506787 avg:   1185.8 usec/call for find-pubmed-id
Time: 133.93277 secs. calls: 9441 avg:  14186.3 usec/call for find-memberln
Time: 94.814262 secs. calls: 165392 avg:    573.3 usec/call for add-go-info
Zero calls to find-parent
Time: 27.621622 secs. calls: 9441 avg:   2925.7 usec/call for locate-node

After pull req #110 which caches the pubmed-id lookup, it drops to 802 seconds:

Time: 139.56440 secs. calls: 9441 avg:  14782.8 usec/call for find-go-term
Time: 215.86620 secs. calls: 681 avg: 316984.1 usec/call for match-gene-interactors
Time: 561.00654 secs. calls: 681 avg: 823798.2 usec/call for find-output-interactors
Time: 418.98079 secs. calls: 582626 avg:    719.1 usec/call for generate-result
Time: 31.595877 secs. calls: 582626 avg:     54.2 usec/call for build-interaction
Zero calls to find-protein-form
Time: 10.863595 secs. calls: 47096 avg:    230.7 usec/call for find-name
Time: 182.52820 secs. calls: 582626 avg:    313.3 usec/call for find-pubmed-id
Time: 132.88098 secs. calls: 9441 avg:  14074.9 usec/call for find-memberln
Time: 93.780047 secs. calls: 165392 avg:    567.0 usec/call for add-go-info
Zero calls to find-parent
Time: 28.674607 secs. calls: 9441 avg:   3037.2 usec/call for locate-node

Summary wrap-up: per collection of patches below-- I'm getting from 6x speedups on pathway, and what looks like a 100x speedup biogrid workloads. Don't delay, apply today.

~~The issue #105 shows how to fix the RAM-usage (guile-heap) explosion bug. The annotation now uses less than 4GB of RAM. (see #126, #127)~~ (fixed & merged)
~~The issues #103 and #106 show patches that more than double the performance of gene-pathway-annotation and biogrid-interaction-annotation.~~ (fixed in #132 and #133, merged 28 Feb) Additional optimization is possible with careful code review guided by timing measurements and timing instrumentation.
Pull req #108 provides a more than 3x performance improvement for the pathway-annotation on the current batch of 681 genes. Combined with the other fixes, I see an overall 12x speedup in biogrid annotation. Merged 1 Feb
As a result of these patches, the majority of the CPU time is now spent in the pattern matcher. The pull req opencog/atomspace#2470 speeds up the pattern matcher on certain types of queries used here (about 25% on biogrid-gene annotation, and about 12% on biogrid-protien) No action required, just update your atomspaces.
~~Pull req opencog/atomspace#2471 speeds up biogrid-protien by 3.2x for a grand-total of 23x faster than it was in December (4095 seconds vs 95170 seconds)~~ No action required, just update your atomspaces!
~~Pull req #113 performs three different memoizations, for a performance increase from 1.5x to 10x depending on the code path.~~ ~~(See #129, #130 and #131)~~ (merged) There are other opportunities for memoization as well, that are likely to yield additional 2x or 3x perf improvements.
Don't forget to apply the simple optimizations mentioned up above: https://github.com/MOZI-AI/annotation-scheme/issues/98#issuecomment-570992409
Please please please avoid creating obscenely large lists, as per comment https://github.com/MOZI-AI/annotation-scheme/issues/98#issuecomment-571434957 You don't need these lists, getting rid of them will help both the output, and the subsequent input of the data.

My current collection of hacks is here: https://github.com/linas/annotation-scheme/tree/hackery -- this is the code that was used to obtain all of the results reported above. You can use it to verify the reported results. You can use git diff to see how it's been changed from master, i.e. as a reference to compare against. All of the various experiments were run by loading the files called load-files.scm first, and then test-harness.scm second, and selecting one of the tests.

One experiment ~~not yet performed:~~ is to replace the for-each by n-par-foreach 2 or n-par-for-each 3 (or n-par-map 2) in the following places: https://github.com/MOZI-AI/annotation-scheme/blob/7a79385bcc59c4bae13c0843d4cb9710f2a5a3cd/annotation/gene-pathway.scm#L38 gene-go annotation is 1.55x faster for 2x CPU usage.

https://github.com/MOZI-AI/annotation-scheme/blob/7a79385bcc59c4bae13c0843d4cb9710f2a5a3cd/annotation/gene-go.scm#L33 gene-go annotation is 1.2x faster for 2x CPU usage. Booo.

https://github.com/MOZI-AI/annotation-scheme/blob/7a79385bcc59c4bae13c0843d4cb9710f2a5a3cd/annotation/biogrid.scm#L34 biogrid-protein was 1.7x faster for 2x cpu usage.

~~and see if a speedup is possible. As mentioned before, it might not be,~~ guile parallelizes very poorly, I don't know why and it requires either Ludovic or Andy Wingo or someone of that class to figure it out.

Closing; all suggested changes have been made.

MOZI-AI / annotation-scheme

Performance Status #98