MOZI-AI / annotation-scheme

Human Gene annotation service backend
GNU General Public License v3.0
3 stars 4 forks source link

find-pathway-name should be written out #88

Closed linas closed 4 years ago

linas commented 4 years ago

This is a performance optimization: find-pathway-name performs a very simple search, using a GetLink. it would almost surely be much faster if it was written out, long-hand. Update: this previous statement might be completely false. I have not measured. The proposal below and in #97 might be a bad idea. I will try to measure, and get a better feel for what is going on. In the mean time, I'm leaving this issue open.

Why might a hand-written version be faster? Because of three things:

Flipside: doing it "by hand", in scheme, requires multiple entries in and out of the guile interpreter, each of which has significant overhead, that might outweigh the losses mentioned above.

To add injury to insult: the SetLink is never removed, so it now clutters up the atomspace forever, slowing down future searches. Also, the GetLink is never removed, which also clutters up the atomspace. Both of these can be avoided by using a temp atomspace (I can provide code for that, it's in the learning repo, but maybe should be moved to the main repo). But, for such simple searches, its best to just avoid using GetLink, and reserve that for complicated searches.

linas commented 4 years ago

Similar remark for find-name

linas commented 4 years ago

So I count eight of these so far:

linas commented 4 years ago

OK, so I created an implementation for find-name in pull req #97 that does the lookup directly. With some minor modifications, it could be used to handle all three cases.

I notice that all three of these are doing very basic EvaluationLink searches, of the form

   (Evaluation
      (Predicate "foo")
      (List
         (Concept "bar")
         (Concept "baz")))

and that, given foo and bar they want to find baz (or maybe given baz, they want bar). Given the austere simplicity of this search, and the fact that it seems to be done a lot(?) it would be best to write some C++ code that performed this lookup directly. I do not yet have any sense of where the bottlenecks are, so I cannot tell if this improves the performance situation. But, from casual perusal of the code, it does seem like these functions might get called a lot.

linas commented 4 years ago

Anyway, I don't really know what the performance impact of this may or may not be. It is probably worthwhile to do this both ways, and measure. Ideally, I'm thinking that a benchmark for this would be a good idea. See the https://github.com/opencog/benchmark repo. I'll see what I can do.

linas commented 4 years ago

I'm now trying to understand BindLink/GetLink performance. Measuring the C++ code, while runninggene-go-annotation I see this:

Things are very different for gene-pathway-annotation

For biogrid-interaction-annotation I get:

linas commented 4 years ago

Using the find-name implementation as in #97, I'm measuring (using the previous scheme-based instrumentation code)

Time: 0.8138226 secs. calls: 9801 avg:     83.0 usec/call for find-name

Using the original find-name (based on GetLink) I get:

Time: 3.2294861 secs. calls: 16459 avg:    196.2 usec/call for find-name

so that's kind-of disappointing: the GetLink version is 2.5x slower than the hand-written version. I was expecting parity, because I was expecting the GetLink ctor to run at 40 usec, and the GetLink exec to run at 60 usec. Why its more than double is unclear.

.. Bizarre .. I measured the time to run the GetLink directly, and it is 70 usec (end-to-end, from where cog-execute! enters into c++, from guile, until it leaves c++ and returns to guile). The GetLink ctor takes about 30 usec to run ... so where did the other 196-70-30=96 usec go? Right now, I am guessing that the 96 usecs is lost in guile+atomspace, preparing the half-dozen arguments needed to create the GetLink. This seems .. a bit high. And certainly undesirable. Yuck.

linas commented 4 years ago

Based on further exploration of find-name, it appears that it is being called over and over again with the same arguments. Never mind. This statement is incorrect, and was due to bad instrumentation on my part.

linas commented 4 years ago

I'm going to close this as "bad idea"; I think I know where the slowdowns came from, and how to avoid them with syntax-case.

At any rate, twiddling this does not affect overall performance, this is not where the bottlenecks are.