Closed linas closed 4 years ago
I proposed in #88 that certain simple searches might run faster if done purely in scheme, andor in new custom C++ code. But the extent of this is completely unknown. I have no idea what the performance difference is. Also, even if the hit is large, we don't know what fraction of the total performance is affected. So, to that end, we need performance profiles. Here's what I got:
linux perf
using recepies from http://www.brendangregg.com/perf.html but this does not seem to provide any useful info. It slices and dices the wrong way, for us.(use-modules (statprof))
from https://www.gnu.org/software/guile/manual/html_node/Statprof.html but promptly got a crash from an interrupted system call. Anyway, even avoiding that, it (once again) does not provide the kind of info that is useful for this project.So the only solution I see right now is to manually instrument the code. This is kind-of tedious and painful, but should be quite revealing. It should be specific enough to let us know what parts of the system are gobbling up the CPU time (and even memory) and so tuning those can be made a priority. Here's how (see next post)
So here is a small instrumentation tool I just created:
(define (accum-time name)
"
Maintain performance profile manually.
Example usage:
(define actr (accum-time \"foo\"))
(define bctr (accum-time \"bar\"))
(actr #:enter? #t) ; start timing foo
(bctr #:enter? #t) ; start timing bar
(bctr #:enter? #f) ; stop timing bar
(bctr #:enter? #t) ; start timing bar
(bctr #:enter? #f) ; stop timing bar
(bctr #:enter? #t) ; start timing bar
(bctr #:enter? #f) ; stop timing bar
(actr #:enter? #f) ; stop timing foo
(actr #:report? #t) ; report foo
(bctr #:report? #t) ; report bar
"
(let ((fname name)
(elapsed 0)
(calls 0)
(start 0))
(lambda* (#:key (enter? #f) (report? #f))
(if report?
(if (< 0 calls)
(format #t
"Time: ~9f secs. calls: ~A avg: ~8,1f usec/call for ~A\n"
(* 1.0e-9 elapsed)
calls
(/ (* 1.0e-3 elapsed) calls)
fname)
(format #t "Zero calls to ~A\n" fname))
(if enter?
(set! start (get-internal-real-time))
(begin
(set! elapsed (+ elapsed
(- (get-internal-real-time) start)))
(set! calls (+ calls 1)))))))
)
Here is how I am using it in the code:
(define smpdb-ctr (accum-time "smpdb"))
(define reactome-ctr (accum-time "reactome"))
(define find-pathway-genes-ctr (accum-time "find-pathway-genes"))
(define add-pathway-genes-ctr (accum-time "add-pathway-genes"))
(define find-go-term-ctr (accum-time "find-go-term"))
(define find-memberln-ctr (accum-time "find-memberln"))
(define add-go-info-ctr (accum-time "add-go-info"))
(define find-parent-ctr (accum-time "find-parent"))
and then, for each of these, I did this:
--- a/annotation/functions.scm
+++ b/annotation/functions.scm
@@ -28,7 +28,13 @@
)
;;Given an atom and list of namespaces finds the parents of that atom in the specified namespaces
-(define find-parent
+(define-public (find-parent a b)
+ (find-parent-ctr #:enter? #t)
+ (let ((rv (xfind-parent a b)))
+ (find-parent-ctr #:enter? #f)
+ rv))
+
+(define xfind-parent
(lambda (node namespaces)
(let (
=
It's kind-of tedious to add a bunch of these by hand, so if you have a better idea, do that. Maybe @rekado has some stunt that would be more elegant.
Then run for a while and then say:
(smpdb-ctr #:report? #t)
(reactome-ctr #:report? #t)
(find-pathway-genes-ctr #:report? #t)
(add-pathway-genes-ctr #:report? #t)
(find-go-term-ctr #:report? #t)
(find-memberln-ctr #:report? #t)
(add-go-info-ctr #:report? #t)
(find-parent-ctr #:report? #t)
to get usage reports. If things are still running, recall that you can use the cogserver to get addtional guile prompts:
(use-modules (opencog cogserver))
(start-cogserver)
and then
rlwrap telnet localhost 17001
BTW, @rekado if you are reading this, the reason we have a cogserver instead to using the default guile network server is twofold: (1) the cogserver is about 50x faster than the guile repl server; this makes a huge difference for moving around data, e.g. in robotics. (2) the guile network server is buggy and crashy. It will typically crash after an hour or two, which is completely unacceptable if the compute job has to stay up for months at a time. That's why the cogserver exists. I tried to get rid of it, but no luck. (well, reason 3 is that t also provides a network shell for python, which python itself does not have).
Some example results from (gene-pathway-annotation (list "MRPL27") "foobar" #:pathway "reactome smpdb" #:parents 0)
:
Time: 0.0373287 secs. calls: 1 avg: 37328.7 usec/call for smpdb
Time: 94.282941 secs. calls: 1 avg: 94282940.8 usec/call for reactome
Time: 9.7742674 secs. calls: 3 avg: 3258089.1 usec/call for find-pathway-genes
Time: 9.2660640 secs. calls: 267 avg: 34704.4 usec/call for add-pathway-genes
So, looks like reactome
is eating CPU, but not clear where. OK, new and improved:
Time: 0.0249174 secs. calls: 1 avg: 24917.4 usec/call for smpdb
Time: 86.396381 secs. calls: 1 avg: 86396381.1 usec/call for reactome
Time: 9.1681212 secs. calls: 3 avg: 3056040.4 usec/call for find-pathway-genes
Time: 8.6186269 secs. calls: 267 avg: 32279.5 usec/call for add-pathway-genes
Zero calls to find-go-term
Zero calls to find-memberln
Zero calls to add-go-info
Zero calls to find-parent
Time: 0.0075673 secs. calls: 2 avg: 3783.7 usec/call for find-pathway-member
Time: 75.875504 secs. calls: 3 avg: 25291834.8 usec/call for pathway-gene-interactors
Time: 0.0034641 secs. calls: 3 avg: 1154.7 usec/call for pathway-hierarchy
Time: 0.0538150 secs. calls: 2 avg: 26907.5 usec/call for find-protein
Zero calls to find-protein-form
Time: 1.2777262 secs. calls: 6 avg: 212954.4 usec/call for find-mol
so now its pathway-gene-interactors
.. what is it doing? It's calling generate-interactors
. And what is that doing? And the answer is ...
.. untracable. It's written as a call/cc
so this style of tracing won't work. (Surely there is a simpler and easier way of doing implementing without call/cc
? Can't we just keep track of state explicitly, viz. write a simple object with needed internal state?
Anyway reading the source, we see that it calls biogrid-pairs-pathway
which is equally untracable. It seems to be something out of the json library (Update Not the json library, but something else???). The conclusion is then that 88% of the time -- 0.878225489560726 = (/ 75.875504 86.396381)
is spent not in the atomspace, not in any of the (annotation)
guile modules, but in the json library this other thing.
To re-iterate the last remark ... looks like 88% of all CPU time is spent in the json library this other thing. I wouldn't be surprised if that is where all the memory is leaking too, because I just don't see any memory-gobbling code anywhere in the (annotation)
module. It has warts and inefficiencies, but nothing quite like that.
FYI, I ran five more heavyweight genes: HSP90B1 AP1G1 PPIB NDUFAF7 RBM5
, and some lightweights I didn't write down. The cumulative results are below:
Time: 258.70370 secs. calls: 10 avg: 25870369.8 usec/call for smpdb
Time: 561.16087 secs. calls: 10 avg: 56116087.5 usec/call for reactome
Time: 137.29761 secs. calls: 31 avg: 4428955.1 usec/call for find-pathway-genes
Time: 126.09734 secs. calls: 1668 avg: 75597.9 usec/call for add-pathway-genes
Zero calls to find-go-term
Zero calls to find-memberln
Zero calls to add-go-info
Zero calls to find-parent
Time: 0.0789199 secs. calls: 20 avg: 3946.0 usec/call for find-pathway-member
Time: 591.78595 secs. calls: 31 avg: 19089869.3 usec/call for pathway-gene-interactors
Time: 0.0199050 secs. calls: 16 avg: 1244.1 usec/call for pathway-hierarchy
Time: 2.1588711 secs. calls: 20 avg: 107943.6 usec/call for find-protein
Zero calls to find-protein-form
Time: 88.405886 secs. calls: 62 avg: 1425901.4 usec/call for find-mol
Time: 5.6221634 secs. calls: 4943 avg: 1137.4 usec/call for find-pubmed-id
As before, its clear that find-pathway-interactors
is eating up most of the CPU time. It appears that add-pathway-genes
and find-mol
are also significant CPU time, I believe that for these two, time its all lost in the pattern-matcher.
Update: last night, I assumed that make-parameter
was a part of json. This morning, I searched for make-parameter
and discovered it was part of guile. So: new conclusion: It's not json that is causing trouble, its the combination of the call/cc
plus the use of make-parameter
that is creating an explosive gobbling up of RAM and CPU. That is, the call/cc
is probably creating zillions of continuations, keeping them around indefinitely, eating all RAM and cycles.
Now, the obvious thing becomes: can one redesign to not use call/cc
? Continuations are a rather olde-fashioned way of implementing what are called "objects" in the C++ world. It's quite easy to write objects in scheme, and you don't need goops to do that. For example: here is a simple object with two methods:
(define (counter)
(let ((cur-val 0))
(lambda (method)
(if (eq? method 'incr) (set! cur-val (+ cur-val 1)))
(if (eq? method 'decr) (set! cur-val (- cur-val 1)))
(if (eq? method 'print) (format #t "The current value is ~A\n" cur-val))
)))
(define my-counter (counter))
(define another-one (counter))
(my-counter 'incr)
(my-counter 'incr)
(another-one 'decr)
(my-counter 'print)
(another-one 'print)
I'm proposing that the call/cc
can be replaced by something like this, thus explicitly controlling the state management in the system, and limiting the total RAM usage. The point being that here, the continuation state is very simple, it's just an int called cur-val
-- its small, its explicitly managed, and unlike call/cc, there is copy any kind of large, long, complex recursive execution state.
Parameters are like global variables. They should be used sparingly.
Objects could also be implemented with Guile records, which is a little nicer than plain closures. There are only very few situations where a full call/cc
is the right tool. Even when you need continuations it's better to use delimited continuations (e.g. prompts).
So yeah: what's the purpose of using call/cc
here?
Generally, the code leans heavily on mutation and global state, both of which are unidiomatic in Scheme. It seems to me that most if not all cases of mutation could be removed. Parameters should probably be replaced with either an explicit configuration record / alist or other explicit arguments.
After some sleep and more time to think, here's a collection of performance tuning suggestions:
Use gcc version 7.4 or newer; this provides a 2x performance improvement over gcc-5.4 in the pattern matcher performance. The difference is just immense. In particular, Ubuntu 16.04 LTS has the old, slow version, so it's critically important to run either ubuntu-18.04 or Debian stable (or a modern version of redhat/fedora/suse, etc.)
Start guile as HUGETLB_MORECORE=yes LD_PRELOAD=/usr/lib/libhugetlbfs.so.0 guile
-- this can offer a 20% to 40% improvement on AtomSpace performance (in general, I have not measured the effect for this code) The above also requires that hugepages are configured correctly.
Get the latest atomspace code. In particular, opencog/atomspace#2464, #2466, #2469 combined provide another 20% to 30% (based on other benchmarks; I have not measured here.) Pull req opencog/atomspace#2470 improves biogrid searches by 25%.
The performance of file-loading can be improved based on the comment made here: https://github.com/MOZI-AI/annotation-scheme/issues/93#issuecomment-571388707 - basically, it would be faster to load the atomese datasets by writing a C++ loader, and skipping guile entirely.
I've got a wild idea for changes to the Pattern matcher that could help pathway-gene-interactors
by some fair bit; however, the idea is complex and will require significant effort. Ping me for status.
I kind of don't want to mention this, cause it's problematic. Try the patch below: it improves the wall-clock time by 20% for gene-go-annotation
(at the expense of using 2 cpu cores)
--- a/annotation/gene-go.scm
+++ b/annotation/gene-go.scm
@@ -28,9 +28,12 @@
#:export (gene-go-annotation)
)
+(use-modules (ice-9 threads)) + (define* (gene-go-annotation gene-nodes file-name #:key (namespace "biological_process molecular_function cellular_component") (parents 0) (protein "True")) (let (
Why such a pitiful improvement for 2x cpu use? I'm not sure. I've fiddled with guile threading in other contexts, and its hard to get more than 2x or 3x improvement from guile threads. There's some kind of severe lock-contention/thread-thrashing when more than 3 threads are used. I have not been able to locate the root cause of this, but I strongly suspect that it's an issue deep in guile, rather than contention being in the atomspace.
Also, (I believe) that using set!
is not manifestly thread-safe in guile, which is one more reason to rewrite code to be tail-recursive (or use fold
or use map
): those are manifestly thread-safe.
@linas I removed the call/cc usage completely, please check the PR
Generally, the code leans heavily on mutation and global state, both of which are unidiomatic in Scheme. It seems to me that most if not all cases of mutation could be removed.
Yes, it is plagued with mutation. We wrote it like that initially because A) the code is used from python as submodule and has to maintain the state of requests B) While writing this code it was our first experience with Scheme and we had an OOP background.
Removing the mutation for the most part shouldn't be an issue ( I will send PRs). But it is difficult for the parser which has to maintain some state while parsing the scheme(Atomese?) to json.
while parsing the scheme(Atomese?) to json.
Why do you need to parse anything at all? If you need to write json to a pipe or to a file, or whatever, can't you just write it directly? At any rate, none of the code that I looked at here requires json to do anything, so it's not clear why its even spec'ed in this repo.
call/cc
I just now tried the patch #101 and am measuring gene-pathway-annotation
I get the following instrumentation:
Time: 43.445660 secs. calls: 12 avg: 3620471.7 usec/call for smpdb
Time: 2534.6847 secs. calls: 13 avg: 194975746.5 usec/call for reactome
Time: 20.328536 secs. calls: 58 avg: 350492.0 usec/call for find-pathway-genes
Time: 19.555418 secs. calls: 3771 avg: 5185.7 usec/call for add-pathway-genes
Zero calls to find-go-term
Zero calls to find-memberln
Zero calls to add-go-info
Zero calls to find-parent
Time: 0.1160833 secs. calls: 26 avg: 4464.7 usec/call for find-pathway-member
Time: 3546.3087 secs. calls: 57 avg: 62215941.9 usec/call for pathway-gene-interactors
Zero calls to generate-result
Time: 1870.1507 secs. calls: 81483 avg: 22951.4 usec/call for generate-interactors
Zero calls to build-interaction
Time: 0.3078500 secs. calls: 43 avg: 7159.3 usec/call for pathway-hierarchy
Time: 3.3938785 secs. calls: 25 avg: 135755.1 usec/call for find-protein
Zero calls to find-protein-form
Time: 35.932332 secs. calls: 116 avg: 309761.5 usec/call for find-mol
Time: 6.0779556 secs. calls: 27301 avg: 222.6 usec/call for find-name
Time: 11.885580 secs. calls: 11144 avg: 1066.5 usec/call for find-pubmed-id
The total CPU time, as reported by ps aux
is 5300 seconds (about 89 minutes) of the 5300, approx 750 was spent loading the dataset, so actually, only 4550 seconds actually processing. GC has run for 1046 seconds. Of this, there were 500 GC seconds lost in file loading, so the other 546 of GC was lost during processing. This leaves (5300 - 750 - 550) = 4000 seconds spent in actual pathway annotation. The instrumentation found 3550 seconds in pathway-gene-interactors
and I'm unclear as to where the remaining 400 or 500 seconds were lost.
So pathway-gene-interactors
is actually very interesting: it is a non-trivial pattern search, exactly the kind of non-trivial search for which the pattern matcher was created for.
Curiously, though, of that 3550 seconds, more than half was spent in generate-interactors
and, reading through that code, it does not do anything at all fancy or complicated. So that's troubling.
Now that I understand what make-parameter
does, and that biogrid-pairs-pathway
is an instance of that, I would like to suggest that there might be a better, faster substitute: use (make-atom-set)
from the core opencog module. I think that (make-atom-set)
is (almost) a drop-in replacement for make-parameter
except that I'm guessing its faster, because it is fine-tuned for atoms. (I'm not sure if its an exact drop-in replacement; I wasn't aware of make-parameter
before. The general, overall idea seems to be the same, though. and make-atom-set
can only ever hold atoms, and nothing else.).
More analysis of gene-pathway-annotation
I've now run this 4 times with parents=0 and each time, it calmly proceeds at 100% CPU for about 2 hours, when RAM usage starts to grow, slowly at first, and then ever-faster. Examining (gc-stats)
shows that much or most of this growth is on the guile side, not the AtomSpace side.
To narrow this down, I've added print statements to gene-pathway-annotation
to provide a report after it finishes each gene. Most genes take zero or one seconds; a few take 30 or 60 seconds, and maybe one out of 10 takes five minutes. This is not surprising. I bet if I graphed this, I'd get a Zipf distribution. It would actually be an interesting graph to look at and contemplate, since it gives you a good idea of where the biological complexity lies, and where to focus software development effort. I'm not just saying this to intellecutalize: this is an actual software development technique you can apply. Nature is Zipfian. You have to be prepared for that.
Anyway, I'm monitoring the CPU times, and the length of two lists: the result
list and the pwlst
list. For example, doing the reactome for gene RPS20
took over half an hour (2095 seconds)(!), and caused (length result)
to grow to 17812 from 9882 As it did so, the guile heap size grew from 19MBytes to 26MBytes. Also, gc ran 7K times during that half-hour. (Prior to this point, gc had only run about 1K times) There were 270 seconds spent in GC, or about 270/2095 = 13% of the CPU time. (The smbdb part for the same gene took 45 minutes, with a similar bump for gc)
Now, RPS20
is only number 13 out of 681. So I haven't even gotten through 2% of the grand-total number of genes. .. and it has taken 60 minutes of CPU time to get this far. So estimated run-time is fifty hours. That's a lot. If the length of result
grows linearly like this, then it will be about 1 million long, by the end. (exept Zipfian growth isn't really linear, so I'm not sure). Based on the gc behavior so far, I'm guessing that this will also be the cause of the guile heap explosion (which is only an hour or two away, based on prior experience -- and not fifty hours away.)
If I read the code in gene-pathway.scm
, it seems that the only reason we need result
is so that we can create one monster-sized atom: (ListLink result)
https://github.com/MOZI-AI/annotation-scheme/blob/57fbdfe5d6f11af31af43489d979865ecdab8082/annotation/gene-pathway.scm#L54 which is then written to a file. Placing such a monster-sized Atom into the AtomSpace is .. possible, certainly, but is going to be slow and wasteful, given that it is only used to write things out to a file, and then ignored and forgotten. It's pointless bloat. If/when that file is ever read in again, the converse holds true: you will have one or two or a handful of pointlessly-huge atoms, which are unreferenced, ignored and forgotten, eating up RAM, but not being useful in any way.
Given that this also seems to be smashing the guile heap, I'm thinking that a solution here would be to write results to a pipe or file(*), as they are obtained, so that they don't sit in the guile heap. This alone might be enough to avoid the Increase MAXHINCR or MAX_HEAP_SECTS
issue.
Footnote (*): If one wanted to get fancy, one could create a thread-safe QueueValue
as discussed elsewhere, and then specialize it so that one end of the queue was a pipe or a file. So, just stuff results, as they are obtained, into the QueueValue
, and bingo - they then pop out wherever they are meant to go. So QueueValues have been discussed for other apps, but attaching them to a pipe/file is a new (and fun) idea.
OK, so ... 140 minutes in, the guile heap has started to grow more rapidly, we're now on gene 27 of 681, the guile heap is now 670MBytes .. but the length of result
is only 30K so it is not that much larger: not enough to account for the enlarging heap. Now, pwlen
is 27 - so that is very modest, but it seems like the heap grows by 20MB or 50MB or 200MB every time that pwlen
grows by one. So what's going on there? Another hypotheses: does this have something to do with make-parameter
, again?
OK so at 166 min (cpu time, not wall-clock time) RAM usage is exploding. guile heap is 3GBytes as I write this, and growing. The gene being processed is ERCC8
, number 28 of 681. The atomspace is only 3% larger than it was at the beginning, so effectively all of the RAM explosion is the guile heap.
six minutes later, wall-clock time, guile heap is at 6GBytes... Its now growing by a gig or two every 5 minutes, and based on past experience, this will continue until OOM. Unclear if this is due to its a general systemic thing. (Update: its a systemic thing; doing ERCC8
only, or ifERCC8
all by itself does not cause any problems.) Completely wild and unfounded guess, but this might be due to #103 ... (Update: Nope. See #103 for details; an experimental patch does not solve the explosion.)
During the explosion (now 21GB RSS) I have 5215 seconds spent in the code (not counting GC) but instrumentation shows:
Time: 1142.1741 secs. calls: 27 avg: 42302744.5 usec/call for smpdb
Time: 1814.1401 secs. calls: 27 avg: 67190375.5 usec/call for reactome
Time: 2779.7014 secs. calls: 104 avg: 26727897.6 usec/call for pathway-gene-interactors
Time: 2223.7673 secs. calls: 75 avg: 29650230.0 usec/call for pathway-hierarchy
In the past, the time in pathway-hierarchy
was miniscule, but now its accounting for half. Maybe that is where the explosion is? Yes, it is. Before explosion:
Time: 152.94239 secs. calls: 74 avg: 2066789.0 usec/call for pathway-hierarchy
Time: 65.234275 secs. calls: 101 avg: 645883.9 usec/call for check-pathway
During explosion
Time: 2202.2932 secs. calls: 75 avg: 29363909.6 usec/call for pathway-hierarchy
Time: 2570.8179 secs. calls: 104 avg: 24719402.9 usec/call for check-pathway
Nothing else is changing: only these two. One single call to pathway-hierarchy
racked up more than half-an-hour, and 3 calls to check-pathway
racked up almost 45 minutes, and nothing else was going on on the system. So this is now the focus.
Update: The explosion is fixed in #105 I do not understand why the original code was exploding.
After #105, I am now measuring:
Time: 0.2827939 secs. calls: 235 avg: 1203.4 usec/call for pathway-hierarchy
Time: 0.0124460 secs. calls: 327 avg: 38.1 usec/call for check-pathway
Yep, it all looks under control now. so far, done 252 of 681, and wall-clock-time=cpu=time=23259 seconds, and instrumentation shows:
Time: 4516.1446 secs. calls: 251 avg: 17992607.8 usec/call for smpdb
Time: 18734.084 secs. calls: 251 avg: 74637783.6 usec/call for reactome
Time: 703.77133 secs. calls: 1207 avg: 583074.8 usec/call for find-pathway-genes
Time: 676.99826 secs. calls: 70435 avg: 9611.7 usec/call for add-pathway-genes
Time: 1.7678717 secs. calls: 503 avg: 3514.7 usec/call for find-pathway-member
Time: 21444.577 secs. calls: 1206 avg: 17781572.8 usec/call for pathway-gene-interactors
Time: 55.257491 secs. calls: 887597 avg: 62.3 usec/call for generate-interactors
Time: 1.1757659 secs. calls: 860 avg: 1367.2 usec/call for pathway-hierarchy
Time: 0.1137129 secs. calls: 1200 avg: 94.8 usec/call for check-pathway
Time: 46.105715 secs. calls: 502 avg: 91844.1 usec/call for find-protein
Time: 1051.8928 secs. calls: 2413 avg: 435927.4 usec/call for find-mol
Time: 101.48543 secs. calls: 495541 avg: 204.8 usec/call for find-name
Time: 27.638528 secs. calls: 33525 avg: 824.4 usec/call for find-pubmed-id
RSS holding stead and 3.9GB; guile heap at 74MBytes. It appears that pathway-gene-interactors
eats 92% of the cpu time. This is mostly a well-structured BindLink
, and, as mentioned elsewhere, might benefit from altering the pattern matcher to use more tricks from SAT. e.g. clause pruning.
For all 681 genes, I get:
Time: 11062.356 secs. calls: 681 avg: 16244281.7 usec/call for smpdb
Time: 50644.088 secs. calls: 681 avg: 74367236.4 usec/call for reactome
Time: 2523.0404 secs. calls: 3769 avg: 669419.0 usec/call for find-pathway-genes
Time: 2382.0801 secs. calls: 183670 avg: 12969.3 usec/call for add-pathway-genes
Time: 5.3277335 secs. calls: 1362 avg: 3911.7 usec/call for find-pathway-member
Time: 53795.878 secs. calls: 3769 avg: 14273249.7 usec/call for pathway-gene-interactors
Time: 100.53601 secs. calls: 2133970 avg: 47.1 usec/call for generate-interactors
Time: 1.4259363 secs. calls: 2197 avg: 649.0 usec/call for pathway-hierarchy
Time: 0.2986100 secs. calls: 3080 avg: 97.0 usec/call for check-pathway
Time: 60.727192 secs. calls: 1362 avg: 44586.8 usec/call for find-protein
Time: 5315.8542 secs. calls: 7538 avg: 705207.5 usec/call for find-mol
Time: 232.81251 secs. calls: 1269787 avg: 183.3 usec/call for find-name
Time: 32.682447 secs. calls: 37884 avg: 862.7 usec/call for find-pubmed-id
Time: 6655.7199 secs. calls: 250256 avg: 26595.6 usec/call for locate-node
Time: 51.833697 secs. calls: 165299 avg: 313.6 usec/call for add-loc
Total runtime = 61810 seconds (17+ hours). The caching tweak opencog/atomspace#2470 provides no improvement at all (a surprise...!??)
The function pathway-gene-interactors
appears to be called twice several times for every gene. Since it accounts for almost all 60% of CPU time, performance can be almost doubled more than tripled by caching the results from the first call. See pull req #108
After #108 and after #2471 I get a run-time of 14382 seconds, below. It looks like locate-node
and find-mol
are now bottlenecks, will run again with memoized version of locate-node
.. whereas find-mol
just needs a major redesign, it looks to be just plain poorly designed.
(gene-pathway-annotation gene-list "path-results"
#:pathway "reactome smpdb"
#:namespace ""
#:parents 0)
Time: 5523.3938 secs. calls: 681 avg: 8110710.4 usec/call for smpdb
Time: 8711.0721 secs. calls: 681 avg: 12791588.9 usec/call for reactome
Time: 1239.0317 secs. calls: 3769 avg: 328742.8 usec/call for find-pathway-genes
Time: 1090.1401 secs. calls: 183670 avg: 5935.3 usec/call for add-pathway-genes
Time: 5.6236221 secs. calls: 1362 avg: 4128.9 usec/call for find-pathway-member
Time: 10241.141 secs. calls: 1992 avg: 5141134.9 usec/call for pathway-gene-interactors
Time: 44.340495 secs. calls: 326140 avg: 136.0 usec/call for generate-interactors
Time: 1.2863656 secs. calls: 2197 avg: 585.5 usec/call for pathway-hierarchy
Time: 0.3182439 secs. calls: 3080 avg: 103.3 usec/call for check-pathway
Time: 7.9706777 secs. calls: 1362 avg: 5852.2 usec/call for find-protein
Time: 2732.6366 secs. calls: 7538 avg: 362514.8 usec/call for find-mol
Time: 235.05631 secs. calls: 1269787 avg: 185.1 usec/call for find-name
Time: 31.868447 secs. calls: 37884 avg: 841.2 usec/call for find-pubmed-id
Time: 2536.5483 secs. calls: 250256 avg: 10135.8 usec/call for locate-node
Time: 54.564485 secs. calls: 165299 avg: 330.1 usec/call for add-loc
Patch #113 provides an additional 1.4x perf improvement; it decimates the outrageous amount of cpu time spent in locate-node
, dropping it by a factor of over 100x. The n latest runtime is 10554 seconds wall-clock:
Time: 2666.1765 secs. calls: 681 avg: 3915090.3 usec/call for smpdb
Time: 7748.8657 secs. calls: 681 avg: 11378657.4 usec/call for reactome
Time: 145.61418 secs. calls: 3769 avg: 38634.7 usec/call for find-pathway-genes
Time: 25.489879 secs. calls: 183670 avg: 138.8 usec/call for add-pathway-genes
Time: 3.7654062 secs. calls: 1362 avg: 2764.6 usec/call for find-pathway-member
Time: 10092.041 secs. calls: 1992 avg: 5066285.5 usec/call for pathway-gene-interactors
Time: 52.768517 secs. calls: 387813 avg: 136.1 usec/call for generate-interactors
Time: 1.0577963 secs. calls: 2197 avg: 481.5 usec/call for pathway-hierarchy
Time: 0.2858599 secs. calls: 3080 avg: 92.8 usec/call for check-pathway
Time: 6.2650159 secs. calls: 1362 avg: 4599.9 usec/call for find-protein
Time: 160.83718 secs. calls: 7538 avg: 21336.9 usec/call for find-mol
Time: 0.4193283 secs. calls: 1536 avg: 273.0 usec/call for find-name
Time: 38.692187 secs. calls: 37884 avg: 1021.3 usec/call for find-pubmed-id
Time: 19.413556 secs. calls: 10578 avg: 1835.3 usec/call for locate-node
Time: 50.028115 secs. calls: 165299 avg: 302.7 usec/call for add-loc
Prior to #2471 I am getting these results for pathway+namespaces: 213324 seconds elapsed (that's 59 hours). This is with patch #108 applied. The guile heap was stable at 150MBytes until the very end, but then blew up to 16GB during the file-write. There is something deeply pathological about the file-write. See issue #111
(gene-pathway-annotation gene-list "my-path-results"
#:pathway "reactome smpdb"
#:namespace "biological_process molecular_function cellular_component"
#:parents 0)
Time: 38572.841 secs. calls: 681 avg: 56641469.2 usec/call for smpdb
Time: 174376.03 secs. calls: 681 avg: 256058777.1 usec/call for reactome
Time: 195778.03 secs. calls: 3769 avg: 51944290.1 usec/call for find-pathway-genes
Time: 195630.63 secs. calls: 183670 avg: 1065120.2 usec/call for add-pathway-genes
Time: 6.0805688 secs. calls: 1362 avg: 4464.4 usec/call for find-pathway-member
Time: 11552.666 secs. calls: 1992 avg: 5799531.2 usec/call for pathway-gene-interactors
Time: 47.519931 secs. calls: 326140 avg: 145.7 usec/call for generate-interactors
Time: 1.6497263 secs. calls: 2197 avg: 750.9 usec/call for pathway-hierarchy
Time: 0.2986489 secs. calls: 3080 avg: 97.0 usec/call for check-pathway
Time: 61.086829 secs. calls: 1362 avg: 44850.8 usec/call for find-protein
Time: 5542.0427 secs. calls: 7538 avg: 735213.9 usec/call for find-mol
Time: 5362.4040 secs. calls: 183670 avg: 29195.9 usec/call for find-go-term
Time: 172771.66 secs. calls: 1444979 avg: 119566.9 usec/call for find-name
Time: 32.630861 secs. calls: 37884 avg: 861.3 usec/call for find-pubmed-id
Time: 5161.0680 secs. calls: 183670 avg: 28099.7 usec/call for find-memberln
Time: 2523.4792 secs. calls: 4934750 avg: 511.4 usec/call for add-go-info
Zero calls to find-parent
Time: 194719.85 secs. calls: 250256 avg: 778082.7 usec/call for locate-node
Time: 58.992121 secs. calls: 165299 avg: 356.9 usec/call for add-loc
After applying patch #113 and also #2471, I get a 13x speedup, from 2.5 days to 4.6 hours (16569 seconds, actually) with the following profile. Note that #113 memoizes locate-node
and thus decimates execution time for that. I'm sure that aggressive memoization of many other pathways will improve performance by additional factors.
Time: 3854.4545 secs. calls: 681 avg: 5659992.0 usec/call for smpdb
Time: 12306.061 secs. calls: 681 avg: 18070573.9 usec/call for reactome
Time: 4497.3962 secs. calls: 3769 avg: 1193259.8 usec/call for find-pathway-genes
Time: 4361.5304 secs. calls: 183670 avg: 23746.6 usec/call for add-pathway-genes
Time: 6.7424449 secs. calls: 1362 avg: 4950.4 usec/call for find-pathway-member
Time: 11454.457 secs. calls: 1992 avg: 5750229.3 usec/call for pathway-gene-interactors
Time: 58.174262 secs. calls: 387543 avg: 150.1 usec/call for generate-interactors
Time: 1.4399045 secs. calls: 2197 avg: 655.4 usec/call for pathway-hierarchy
Time: 0.3013116 secs. calls: 3080 avg: 97.8 usec/call for check-pathway
Time: 6.9740637 secs. calls: 1362 avg: 5120.5 usec/call for find-protein
Time: 185.13425 secs. calls: 7538 avg: 24560.1 usec/call for find-mol
Time: 4315.0330 secs. calls: 183670 avg: 23493.4 usec/call for find-go-term
Time: 2.1543851 secs. calls: 1537 avg: 1401.7 usec/call for find-name
Time: 40.919641 secs. calls: 37884 avg: 1080.1 usec/call for find-pubmed-id
Time: 4281.8303 secs. calls: 183670 avg: 23312.6 usec/call for find-memberln
Time: 2596.3717 secs. calls: 4934750 avg: 526.1 usec/call for add-go-info
Zero calls to find-parent
Time: 24.414441 secs. calls: 10578 avg: 2308.0 usec/call for locate-node
Time: 57.044634 secs. calls: 165299 avg: 345.1 usec/call for add-loc
So biogrid-interaction-annotation
completed without any failures. It took 26.4 wall-clock hours to run, or 30.25 CPU-hours. Of this, 7.4 hours was in guile GC. At the conclusion, the guile heap was 424MB of which 294MB was free. At the conclusion, RSS=4.6GB, and Virt=9.1GB RAM. The final atomspace contained 9M atoms, which is about double the starting size of 4.6M Atoms. The output file was 250MBytes in size (uncompressed).
Conclusions: RAM usage seems reasonable, the GC usage is a bit outrageous, I suspect it's due to results being saved up in very long lists (as witnessed by most of the heap being free, once it completes). Total CPU times seems outrageous; don't know where it went. Will instrument and measure.
At wall-clock time of appox 3022 seconds, the Instrumented code shows this profile
Time: 268.29849 secs. calls: 75 avg: 3577313.2 usec/call for match-gene-interactors
Time: 2945.8490 secs. calls: 75 avg: 39277986.8 usec/call for find-output-interactors
Time: 2067.2802 secs. calls: 46366 avg: 44586.1 usec/call for generate-result
Time: 8.2950412 secs. calls: 92733 avg: 89.5 usec/call for build-interaction
Time: 259.28143 secs. calls: 95765 avg: 2707.5 usec/call for find-protein-form
Time: 0.2283316 secs. calls: 256 avg: 891.9 usec/call for find-name
Time: 61.317968 secs. calls: 46367 avg: 1322.4 usec/call for find-pubmed-id
so the instrumentation is capturing almost all CPU cycles. Notable is that generate-result
is eating 2/3rds of all cycles. Looking at this code reveals a structure similar to the old implementation of generate-interactors
, relying on a linear search of make-parameter
. This suggests that a fix similar to that of #103 will also fix things here.
Much of the remaining time is lost in find-output-interactors
which has the interesting-type pattern search, and as mentioned elsewhere, I have ideas for fixing that.
After patching (see issue #106), performance more than doubles:
Time: 34.925137 secs. calls: 75 avg: 465668.5 usec/call for match-gene-interactors
Time: 1409.3709 secs. calls: 75 avg: 18791612.0 usec/call for find-output-interactors
Time: 330.18767 secs. calls: 45971 avg: 7182.5 usec/call for generate-result
Time: 7.3280768 secs. calls: 91943 avg: 79.7 usec/call for build-interaction
Time: 254.12882 secs. calls: 94881 avg: 2678.4 usec/call for find-protein-form
Time: 0.0568310 secs. calls: 247 avg: 230.1 usec/call for find-name
Time: 57.894680 secs. calls: 45972 avg: 1259.3 usec/call for find-pubmed-id
wall-clock time 1447 seconds for the first 75 genes.
To do all 681 (for protein interaction):
(biogrid-interaction-annotation gene-list "my-biogrid-results"
#:namespace "biological_process molecular_function cellular_component"
#:interaction "Proteins"
#:parents 0)
Time: 320.42858 secs. calls: 681 avg: 470526.6 usec/call for match-gene-interactors
Time: 14377.118 secs. calls: 681 avg: 21111774.1 usec/call for find-output-interactors
Time: 4280.8901 secs. calls: 506787 avg: 8447.1 usec/call for generate-result
Time: 68.353809 secs. calls: 1013574 avg: 67.4 usec/call for build-interaction
Time: 3567.3805 secs. calls: 1023015 avg: 3487.1 usec/call for find-protein-form
Time: 0.1304556 secs. calls: 561 avg: 232.5 usec/call for find-name
Time: 582.39830 secs. calls: 506787 avg: 1149.2 usec/call for find-pubmed-id
To do all 681 (for protein interaction), after opencog/atomspace#2471, the performance more than triples (3.2x faster), tot total wall-clock time of 4095 seconds for all 681 genes, below. All new bottlenecks now emerge: generate-result
which bottlenecks on find-protein-form
which takes 2946/4095=72% of the total Also, GC takes 766 seconds, so GC is now a large part of the overhead.
(biogrid-interaction-annotation gene-list "my-biogrid-results"
#:namespace "biological_process molecular_function cellular_component"
#:interaction "Proteins"
#:parents 0)
Time: 301.44206 secs. calls: 681 avg: 442646.2 usec/call for match-gene-interactors
Time: 3766.9675 secs. calls: 681 avg: 5531523.5 usec/call for find-output-interactors
Time: 3682.5601 secs. calls: 506787 avg: 7266.5 usec/call for generate-result
Time: 70.670652 secs. calls: 1013574 avg: 69.7 usec/call for build-interaction
Time: 2946.7653 secs. calls: 1023015 avg: 2880.5 usec/call for find-protein-form
Time: 0.1185459 secs. calls: 561 avg: 211.3 usec/call for find-name
Time: 603.00910 secs. calls: 506787 avg: 1189.9 usec/call for find-pubmed-id
Zero calls to find-memberln
Zero calls to add-go-info
Zero calls to find-parent
Time: 5.8014628 secs. calls: 9003 avg: 644.4 usec/call for locate-node
Zero calls to add-loc
To do all 681 (for gene interaction, nparents=0) prior to pull req #2470, #2471:
(biogrid-interaction-annotation gene-list "my-biogrid-results"
#:namespace "biological_process molecular_function cellular_component"
#:interaction "Genes"
#:parents 0)
Time: 243.10967 secs. calls: 681 avg: 356989.2 usec/call for match-gene-interactors
Time: 11622.276 secs. calls: 681 avg: 17066485.2 usec/call for find-output-interactors
Time: 840.91115 secs. calls: 506787 avg: 1659.3 usec/call for generate-result
Time: 27.270739 secs. calls: 506787 avg: 53.8 usec/call for build-interaction
Zero calls to find-protein-form
Time: 11.672809 secs. calls: 47096 avg: 247.9 usec/call for find-name
Time: 589.23931 secs. calls: 506787 avg: 1162.7 usec/call for find-pubmed-id
Time: 150.11183 secs. calls: 9441 avg: 15900.0 usec/call for find-go-term
Time: 143.25006 secs. calls: 9441 avg: 15173.2 usec/call for find-memberln
Time: 99.247953 secs. calls: 165392 avg: 600.1 usec/call for add-go-info
All 681(genes, nparents=1) took 1.5x longer than above, and the profile is different:
(biogrid-interaction-annotation gene-list "my-biogrid-results"
#:namespace "biological_process molecular_function cellular_component"
#:interaction "Genes"
#:parents 1)
Time: 5726.3430 secs. calls: 681 avg: 8408726.8 usec/call for match-gene-interactors
Time: 11767.411 secs. calls: 681 avg: 17279605.1 usec/call for find-output-interactors
Time: 6330.9327 secs. calls: 506787 avg: 12492.3 usec/call for generate-result
Time: 28.161058 secs. calls: 506787 avg: 55.6 usec/call for build-interaction
Zero calls to find-protein-form
Time: 11.677343 secs. calls: 47096 avg: 247.9 usec/call for find-name
Time: 598.22882 secs. calls: 506787 avg: 1180.4 usec/call for find-pubmed-id
Time: 5625.8787 secs. calls: 9441 avg: 595898.6 usec/call for find-go-term
Time: 146.04930 secs. calls: 9441 avg: 15469.7 usec/call for find-memberln
Time: 247.59138 secs. calls: 406306 avg: 609.4 usec/call for add-go-info
Time: 5445.4115 secs. calls: 165392 avg: 32924.3 usec/call for find-parent
Time: 30.441913 secs. calls: 9441 avg: 3224.4 usec/call for locate-node
Zero calls to add-loc
After pull req #2471, the biogrid gene annotation is nearly 10x faster -- 1204 seconds vs 11887 seconds Once again, total time spent in guile GC is starting to make a large hit to overall performance. All 681 genes is now:
(biogrid-interaction-annotation gene-list "my-biogrid-results"
#:namespace "biological_process molecular_function cellular_component"
#:interaction "Genes"
#:parents 0)
Time: 140.83417 secs. calls: 9441 avg: 14917.3 usec/call for find-go-term
Time: 235.42098 secs. calls: 681 avg: 345698.9 usec/call for match-gene-interactors
Time: 943.35577 secs. calls: 681 avg: 1385250.8 usec/call for find-output-interactors
Time: 834.82299 secs. calls: 506787 avg: 1647.3 usec/call for generate-result
Time: 27.197969 secs. calls: 506787 avg: 53.7 usec/call for build-interaction
Zero calls to find-protein-form
Time: 11.202908 secs. calls: 47096 avg: 237.9 usec/call for find-name
Time: 600.92915 secs. calls: 506787 avg: 1185.8 usec/call for find-pubmed-id
Time: 133.93277 secs. calls: 9441 avg: 14186.3 usec/call for find-memberln
Time: 94.814262 secs. calls: 165392 avg: 573.3 usec/call for add-go-info
Zero calls to find-parent
Time: 27.621622 secs. calls: 9441 avg: 2925.7 usec/call for locate-node
After pull req #110 which caches the pubmed-id lookup, it drops to 802 seconds:
Time: 139.56440 secs. calls: 9441 avg: 14782.8 usec/call for find-go-term
Time: 215.86620 secs. calls: 681 avg: 316984.1 usec/call for match-gene-interactors
Time: 561.00654 secs. calls: 681 avg: 823798.2 usec/call for find-output-interactors
Time: 418.98079 secs. calls: 582626 avg: 719.1 usec/call for generate-result
Time: 31.595877 secs. calls: 582626 avg: 54.2 usec/call for build-interaction
Zero calls to find-protein-form
Time: 10.863595 secs. calls: 47096 avg: 230.7 usec/call for find-name
Time: 182.52820 secs. calls: 582626 avg: 313.3 usec/call for find-pubmed-id
Time: 132.88098 secs. calls: 9441 avg: 14074.9 usec/call for find-memberln
Time: 93.780047 secs. calls: 165392 avg: 567.0 usec/call for add-go-info
Zero calls to find-parent
Time: 28.674607 secs. calls: 9441 avg: 3037.2 usec/call for locate-node
Summary wrap-up: per collection of patches below-- I'm getting from 6x speedups on pathway, and what looks like a 100x speedup biogrid workloads. Don't delay, apply today.
gene-pathway-annotation
and biogrid-interaction-annotation
.My current collection of hacks is here: https://github.com/linas/annotation-scheme/tree/hackery -- this is the code that was used to obtain all of the results reported above. You can use it to verify the reported results. You can use git diff
to see how it's been changed from master, i.e. as a reference to compare against. All of the various experiments were run by loading the files called load-files.scm
first, and then test-harness.scm
second, and selecting one of the tests.
One experiment not yet performed: is to replace the for-each
by n-par-foreach 2
or n-par-for-each 3
(or n-par-map 2
) in the following places:
https://github.com/MOZI-AI/annotation-scheme/blob/7a79385bcc59c4bae13c0843d4cb9710f2a5a3cd/annotation/gene-pathway.scm#L38
gene-go annotation is 1.55x faster for 2x CPU usage.
https://github.com/MOZI-AI/annotation-scheme/blob/7a79385bcc59c4bae13c0843d4cb9710f2a5a3cd/annotation/gene-go.scm#L33 gene-go annotation is 1.2x faster for 2x CPU usage. Booo.
https://github.com/MOZI-AI/annotation-scheme/blob/7a79385bcc59c4bae13c0843d4cb9710f2a5a3cd/annotation/biogrid.scm#L34 biogrid-protein was 1.7x faster for 2x cpu usage.
and see if a speedup is possible. As mentioned before, it might not be, guile parallelizes very poorly, I don't know why and it requires either Ludovic or Andy Wingo or someone of that class to figure it out.
Closing; all suggested changes have been made.
OK, this is not a bug report, but rather a status update (and a request that someone carry out the work mentioned below)
Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS
error when runninggene-pathway-annotation
. I am currently retrying with a version of bdwgc with./configure --enable-large-config
set to see if that helps.Too many heap sections
is due to severe memory fragmentation. In particular, the error showed up while guile was using "only" 6GB RAM, of which probably 98% of that is in the atomspace, and not guile itself.(Right now, on a re-run, the process is using 4.1GB RAM, and guile itself has only 26MBytes in the heap, which seems modest enough to me.)So RAM consumption per-se does not seem to be an issue, neither by the atomspace or guile.Correction: just a few minutes ago, guile went into GC overdrive. Its now trying to GC every10 seconds or so, and it's using almost all CPU cores to do the GC. This happened after about 5 hours runtime, when everything was steady, stable. top/ps aux shows RAM usage suddenly exploded to 10GB RSS and 19GB virt. Currently(gc-stats)
shows 4GB in the guile heap, and 0.5GB of the guile heap free.No clue what/why/where in the code would cause such a monstrous demand on the heap.Update: this is now fixed in #105. I still don't understand the root-cause of the explosion, but the fix does fix it.I'm fairly certain that the excessive usage of deeply nestedI am now certain that this is not the issue, at all.GroundedSchemaNodes
is the root cause of the memory fragmentation. (Update: now I'm not so sure)I'm going to ponder if there is some really clever way to avoid this. However,It might be worthwhile to pre-emptively redesign the code to be flatter, and less recursive. That means replacing the BindLink/GroundedSchema combination with GetLink and srfi-1 map. Just use GetLink to get lists of things, and usemap
to apply whatever processing is needed to the results.I'm guessing this will solve this problem, and maybe even improve runtime. Not sure.This is an interesting alternative design to explore, and it might (or might not) make things smaller, faster. I'm not sure.Here's one problem with
GroundedSchema
: every time that it is called, the guile subsystem has to be re-entered again, and there is some non-trivial number of microseconds (maybe 50?) needed to do this. It also creates a half-dozen C stack frames, and so you risk over-flowing the C stack, if not careful. Many years ago, the atomspace code was tuned to minimize both of these effects, and so maybe they are not as severe as I remember them being, but still -- its a potential trouble area. Or not. Maybe all this is OK as is.See also next comment