Closed kltm closed 7 years ago
Okay I think we have a repeatable instance of the failure, that might cause a larger cumulative failure:
Let's look at: http://noctua.berkeleybop.org/editor/graph/gomodel:581e072c00000295 And while I'm looking at the current version of this model, for repeatability, I believe it's: https://github.com/geneontology/noctua-models/blob/3e58a39944848c93d204fe0b64423e1bcb411013/models/57f1b14b00000173.ttl
Calculating the GPAD on this does not seem to return, or it would return long after timeouts occurred. It also seems to be grinding away slightly at the problem continuously. I wonder if this happened enough if we would get to the same state we were in yesterday.
@malcolmfisher103
Now, possibly, the above model is not the direct cause of the issue, i.e. is was triggered earlier by some input and now nobody can make use of the reasoner. However, it looks like we have eliminated cosmic ray and direct cumulative.
Also pinging @cmungall to see if he might have some insight before the workshop tomorrow.
I see this in top
:
21094 swdev 20 0 42.508g 0.032t 21396 S 1.7 11.6 531:37.88 java
The memory for the heap is maxed out. So right now it's stuck in garbage collection. I don't know why memory would be so high. I suspect this may be more connected to many people running the GPAD query code rather than many people using Arachne. Arachne is part of the GPAD code though. But the reasoning part is really pretty trivial and I expect a much lower memory usage overall compared to ELK. But in the GPAD request, some big models may turn out to be too much for the Jena SPARQL engine, especially if queried repeatedly.
Should have checked that... Interesting that it seemed to keep operating for some models even after the fail for others. I'll try upping the memory (let's hear it for large machines) and do some testing.
Good plan for now, but hopefully we can find out exactly what is taking so much memory. Something's not right.
Still not okay, but interesting to test nonetheless.
So whatever is going on there, it seems not to directly be a memory-related issue.
This is caused by a combination of the pathological bigness + shallowness of NEO, and generated rules that Arachne uses to mark indirect vs. direct types. I made a tiny change to the rule generator and brought it into Minerva: 140bbc9abd3b7fc7d6fa52275d351ef5099ad682
I think this will keep things under control for now (@kltm please update Minerva!) but I want to spend some more time later looking at what's going on in Arachne prior to making the change.
Testing now. So far, it looks like a winner! The model now produces a GPAD in reasonable time and only has a very temporary spike in resources needed.
Thank you for the lightning fast turnaround here--today is workshop day and we were worries that users would be accidentally melting the server all day!
Yesterday, Minerva was found in such a state that, while all functionality that did not require Arachne seemed fine (saving, individual addition, etc.), all functionality that required Arachne (reasoner responses, GPAD output, etc.) would timeout.
Inspecting the system, all cores seemed to be pegged around 50%; this continued on for several (tens of?) hours.
It is unknown if:
A restart of Minerva has returned it to a usable state, with no found issues.
At this point, nothing can be done. However, I wanted this to be reported earlier as I suspect that it will crop up again.
@balhoff