Under some circumstances, the Arachne "reasoning" subsystem fails

kltm commented 7 years ago

Yesterday, Minerva was found in such a state that, while all functionality that did not require Arachne seemed fine (saving, individual addition, etc.), all functionality that required Arachne (reasoner responses, GPAD output, etc.) would timeout.

Inspecting the system, all cores seemed to be pegged around 50%; this continued on for several (tens of?) hours.

It is unknown if:

cosmic ray caused issue
something that happens on specific input caused the subsystem to freak out
something accumulated over time to cause the subsystem to freak out

A restart of Minerva has returned it to a usable state, with no found issues.

At this point, nothing can be done. However, I wanted this to be reported earlier as I suspect that it will crop up again.

@balhoff

kltm commented 7 years ago

Okay I think we have a repeatable instance of the failure, that might cause a larger cumulative failure:

Let's look at: http://noctua.berkeleybop.org/editor/graph/gomodel:581e072c00000295 And while I'm looking at the current version of this model, for repeatability, I believe it's: https://github.com/geneontology/noctua-models/blob/3e58a39944848c93d204fe0b64423e1bcb411013/models/57f1b14b00000173.ttl

Calculating the GPAD on this does not seem to return, or it would return long after timeouts occurred. It also seems to be grinding away slightly at the problem continuously. I wonder if this happened enough if we would get to the same state we were in yesterday.

@malcolmfisher103

kltm commented 7 years ago

Now, possibly, the above model is not the direct cause of the issue, i.e. is was triggered earlier by some input and now nobody can make use of the reasoner. However, it looks like we have eliminated cosmic ray and direct cumulative.

kltm commented 7 years ago

Also pinging @cmungall to see if he might have some insight before the workshop tomorrow.

balhoff commented 7 years ago

I see this in top:

21094 swdev 20 0 42.508g 0.032t 21396 S 1.7 11.6 531:37.88 java

The memory for the heap is maxed out. So right now it's stuck in garbage collection. I don't know why memory would be so high. I suspect this may be more connected to many people running the GPAD query code rather than many people using Arachne. Arachne is part of the GPAD code though. But the reasoning part is really pretty trivial and I expect a much lower memory usage overall compared to ELK. But in the GPAD request, some big models may turn out to be too much for the Jena SPARQL engine, especially if queried repeatedly.

kltm commented 7 years ago

Should have checked that... Interesting that it seemed to keep operating for some models even after the fail for others. I'll try upping the memory (let's hear it for large machines) and do some testing.

balhoff commented 7 years ago

Good plan for now, but hopefully we can find out exactly what is taking so much memory. Something's not right.

kltm commented 7 years ago

Still not okay, but interesting to test nonetheless.

Reboot minerva fresh with 128G
Request the GPAD for the model above (actually going to model and opening the "Annotation preview", but same thing)
Memory climbs to 20G immediately, starts climbing slowly; always a CPU or two pegged, but during memory leaps, all partially pegged
Minerva returns (probably timeout) fail after about five minutes, at arounf 38G, however:
whatever Minerva (Arachne) was doing continues on, as of this writing, it is grinding away at about 43G (res, 138 virt)

So whatever is going on there, it seems not to directly be a memory-related issue.

balhoff commented 7 years ago

This is caused by a combination of the pathological bigness + shallowness of NEO, and generated rules that Arachne uses to mark indirect vs. direct types. I made a tiny change to the rule generator and brought it into Minerva: 140bbc9abd3b7fc7d6fa52275d351ef5099ad682

I think this will keep things under control for now (@kltm please update Minerva!) but I want to spend some more time later looking at what's going on in Arachne prior to making the change.

kltm commented 7 years ago

Testing now. So far, it looks like a winner! The model now produces a GPAD in reasonable time and only has a very temporary spike in resources needed.

kltm commented 7 years ago

Thank you for the lightning fast turnaround here--today is workshop day and we were worries that users would be accidentally melting the server all day!

geneontology / minerva

Under some circumstances, the Arachne "reasoning" subsystem fails #117