geneontology / noctua

Graph-based modeling environment for biology, including prototype editor and services
http://noctua.geneontology.org/
BSD 3-Clause "New" or "Revised" License
36 stars 13 forks source link

Some entities are not available in the graph editor #810

Closed ukemi closed 1 year ago

ukemi commented 1 year ago

Using the 'add individual' functionality of the graph editor, is appears that some entities are not available. For example, try to autocomplete on 'gluconeogenesis' or the mouse (EMAPA) liver.

kltm commented 1 year ago

Working single example: gluconeogenesis Exists in NEO: http://noctua-amigo.berkeleybop.org/amigo/term/GO:0006094

ukemi commented 1 year ago

Maybe I should add this additional bit of weirdness. If I use the other entry portals like "add annotaton', I can get these to autocomplete.

kltm commented 1 year ago

Query is: http://noctua-golr.berkeleybop.org/select?defType=edismax&qt=standard&indent=on&wt=json&rows=10&start=0&fl=*,score&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&fq=document_category:%22ontology_class%22&fq=regulates_closure:%22CHEBI:33695%22%20OR%20regulates_closure:%22GO:0032991%22&facet.field=source&facet.field=idspace&facet.field=subset&facet.field=is_obsolete&q=gluconeogenesis*&=&qf=annotation_class^3&qf=annotation_class_label_searchable^5.5&qf=description_searchable^1&qf=synonym_searchable^1&qf=alternate_id^1

kltm commented 1 year ago

So, I suspect, it's getting filtered by the CHEBI:33695 and GO:0032991 filter?

ukemi commented 1 year ago

Hmmmm? Not sure how to interpret this query. Since I am able to add anything 'within reason', shouldn't it only be limited by a valid ID? It always used to work for everything.

kltm commented 1 year ago

Basically, it's saying to return anything that string matches as long as it's in the regulates closure of CHEBI:33695 OR regulates closure of GO:0032991. Playing around with the query, the issue seems to be that the term is not in the regulates closure of "GO:0032991".

ukemi commented 1 year ago

So has the query changed, or has the ontology changed? It's not clear to me why a continuant would be in the regulates closure. But I can get UBERON terms to autocomplete, so they must be there?????????

kltm commented 1 year ago

There has been no change in the software for this and the filters have been there for years (?), so it wouldn't be at that end. As well, gluconeogenesis is in the noctua-amigo instance just fine, so no worries there. It sounds like it might be under an ontology change of some kind then if you were expecting it? We can maybe loop in Jim if you want to explore that a little?

ukemi commented 1 year ago

Sounds like it has to be on the ontology end then. Maybe we need to loop him in.

kltm commented 1 year ago

@balhoff We were wondering if you might have any thoughts on this thread?

kltm commented 1 year ago

Although, why would one expect a biological process to be in the closure of protein containing complex or chebi entity (http://noctua-amigo.berkeleybop.org/amigo/term/GO:0032991)?

ukemi commented 1 year ago

I looked at that and was puzzled as well. THat's why I asked if the query had changed. Uberon entities certainly don't fit but they are showing up.

balhoff commented 1 year ago

That query is not at all what I always assumed was being searched. It is weird that you can autocomplete things like Uberon liver—it doesn't regulate anything.

ukemi commented 1 year ago

Yep. ?????????????

kltm commented 1 year ago

Talked to @balhoff, I think we're good on the uperbon liver. @kltm to trace back the filters on the "add individual" entry

kltm commented 1 year ago

Okay, walking through what's going on more carefully, @cmungall 's instincts were correct and there is something else going on here. (The filters were a red herring--I likely pulled that from the wrong entry while trying to debug.) As an initial check, I'm rerunning the NEO load to see if there was an issue with the index generation, which seems to be mostly likely.

@vanaukenk The long story is that the "uber noodle" (the add individual free-for-all input) has historically worked off of a different document type than the other entries--basically a general index that takes the information from /anything/ and any field that is fed into the system. For some reason, a mere 364000 entities got loaded, instead of the expected 1975019. That round number sounds an awful lot like the push break in the loader. No matter what, I want to rerun and track down how a load did not complete and did not throw an error; this is a system issue that needs to be traced.

To answer a second question: why use this "general" field instead of the usual "ontology_class" field? I believe the original reasoning was that it is a better field for overall searching as it creates a special search packet with all sorts of things like identifier snippets (NS:123 and '123') which are not in some of the more structured search documents. We could switch over to the "ontology_class" field for this search, but search ability would slightly degrade.

Either way, first step is to track the apparent loader issue a little more and replace the current index with more functional one.

kltm commented 1 year ago

General ontology load self-reports that it loaded everything: [2022-12-07T22:07:03.597Z] 2022-12-07 22:07:03,419 INFO (OntologyGeneralSolrDocumentLoader:55) Doing clean-up (final) commit at 1974266 general ontology documents... Optimization was completed and solr reported as containing 3949301 documents total, which is what we'd expect (basically 2x). It seems like the index was build properly... The machine itself has no disk space issues.

I'm continuing the rerun to make sure we're starting clean.

Also, looking at my notes, last week we had two NEO build failures and a "hiccup" when I successfully ran and deployed on Sunday. My current guess is that the issue is at a "devops" level, rather than a construction level (although that does not explain the round number).

kltm commented 1 year ago

This most recent run, 256000 entities got loaded. I'm not sure why the load seems to drop off part of the way through, but I'm pretty sure this is where the problem exists now. Noting that this runs are slightly shorter than ones that may have been "better" a few weeks ago. Also noting that we have no rollback mechanism (https://github.com/geneontology/neo/issues/21).

kltm commented 1 year ago

Using luke and examining the index directly, amigo is reporting correctly and only 256000 general terms were added, out of nearly 2m. Rebuilding again and am diving into the log looking for trouble.

kltm commented 1 year ago

Error point found:

[2022-12-09T22:40:55.129Z] 2022-12-09 22:40:54,831 INFO  (OntologyGeneralSolrDoc
umentLoader:48) Processed 1000 general ontology docs at 257000 and committing...
[2022-12-09T22:42:02.631Z] Exception in thread "main" org.apache.solr.client.sol
rj.SolrServerException: java.net.SocketException: Broken pipe (Write failed)
[2022-12-09T22:42:02.631Z]      at org.apache.solr.client.solrj.impl.CommonsHttp
SolrServer.request(CommonsHttpSolrServer.java:475)
[2022-12-09T22:42:02.631Z]      at org.apache.solr.client.solrj.impl.CommonsHttp
SolrServer.request(CommonsHttpSolrServer.java:249)

The issue seems to be that owltools does not sufficiently crash out from this error for Jenkins to pick up. I'm not sure if this is a new problem or if this has been quietly happening in the background for some time.

kltm commented 1 year ago

@ukemi @vanaukenk I've done a couple more loads and finally got one that completed: 1974291 entities as expected. With that, assuming things are now working as expected at your end, I'd vote to close this ticket, with anything else (e.g. widget filter redo) going into a new ticket. I've created a new ticket in data QC to try and make sure this becomes automatically checked (https://github.com/geneontology/pipeline/issues/309); will be doing a manual check until that's cleared.

pgaudet commented 1 year ago

This looks like it's the same issue as https://github.com/geneontology/neo/issues/111

ukemi commented 1 year ago

The EMAPA terms are back.

kltm commented 1 year ago

@pgaudet I believe it's likely a different issue.