geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Add the NEO into the main pipeline #35

Open kltm opened 6 years ago

kltm commented 6 years ago

The general idea would be to eliminate as much mechanism as possible as far as deployment and maintenance of multiple pipelines and servers. To this end, I've proposed that NEO (the neo.owl owltools ontology load, sorry @cmungall) gets folded into the main solr load and index. This would simply be:

A separate issue, not dealt with here, would be the adding of the creation of neo.owl itself. As we are just pulling from a URL, this can be separated.

Another, weaker, formulation would be to drop the NEO index separately, but within the new pipeline framework and runs.

kltm commented 6 years ago

Well, starting and exploring this a little bit, it will not pan out in a "merged" index--we would clobber on general, which is used (for example) by the ubernoodle for NEO. Instead, at least for now, we'll look at making another index on the pipeline and switch over to deployment like the other indices we have now.

cmungall commented 6 years ago

switch to solr6 and use a separate core?

kltm commented 6 years ago

The idea is to simplify are current setup, reducing the number of deployed servers and/or number of distinct pipelines. As the Solr 6.x (higher now) is orthogonal, splitting out separately would be at least a temporary bump up in the above.

kltm commented 5 years ago

From an earlier experiment, the overlay is problematic. We'll work towards the weaker form to make progress on things like #73 and https://github.com/geneontology/neo/issues/38#issuecomment-451241964

kltm commented 5 years ago

Until we have a fix for the NEO job automation, it will be a manual step.

kltm commented 5 years ago

From @hdrabkin:

I had created 6 new PRO ids and they became available in our MGI GO EI on Friday. That means they are in the mgi.gpi, (I verified) which I expected would then make them available in Noctua today but they are not there. PR:000050039 PR:000050038 PR:000050037 PR:000050036 PR:000050035 PR:000050034

kltm commented 5 years ago

Also see https://github.com/geneontology/neo/issues/38#issuecomment-451241964

hdrabkin commented 5 years ago

So does this mean these ids will be available soon?

kltm commented 5 years ago

A manual load is finishing now and a spot check seems positive -- try them now?

@cmungall I think there may be something up owltools and the NEO load. It seems to slow down towards the end of the ontology document loading (not for general docs), eventually giving out. I'll try and get a more nuanced view at some point, but it may be best to look towards this as a use case for a new python loader after the go-cams.

kltm commented 5 years ago

Actually, I'm not sure we use anything but the "general" doc in the index... That would greatly speed-up and simplify things.

hdrabkin commented 5 years ago

Hi @cmungall and @kltm Just checked this morning and the Pro ids are all available now. Thanks.

kltm commented 5 years ago

@cmungall We'll need to discuss 1) how we want to migrate the neo build to a new pipeline (whether main or not) and 2) what actual deployment looks like for the ontology

kltm commented 5 years ago

This will need to be tested a bit more, but it looks like the additional resources and updates on our new pipeline can make short work of the NEO products build: http://skyhook.berkeleybop.org/issue-35-neo-test/products/solr/ This can be used to juggle updates in and out more safely in the interim.

kltm commented 5 years ago

From @cmungall : the PURLs are from the given S3 bucket, not Jenkins, so we just clobber them out. He has also agreed with the plan of a second pipeline to support NEO as a separate product from the main pipeline, with the chance to revisit later.

kltm commented 5 years ago

Need more mem for Java:

/obo/BFO_0000040> "BFO:0000040"^^xsd:string) AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/CHEBI_23367> "CHEBI:23367"^^xsd:string) }
18:02:38 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
18:02:38    at com.carrotsearch.hppcrt.sets.ObjectHashSet$EntryIterator.<init>(ObjectHashSet.java:734)
18:02:38    at com.carrotsearch.hppcrt.sets.ObjectHashSet$1.create(ObjectHashSet.java:784)
18:02:38    at com.carrotsearch.hppcrt.sets.ObjectHashSet$1.create(ObjectHashSet.java:779)
18:02:38    at com.carrotsearch.hppcrt.ObjectPool.<init>(ObjectPool.java:74)
18:02:38    at com.carrotsearch.hppcrt.IteratorPool.<init>(IteratorPool.java:51)
18:02:38    at com.carrotsearch.hppcrt.sets.ObjectHashSet.<init>(ObjectHashSet.java:778)
18:02:38    at com.carrotsearch.hppcrt.sets.ObjectHashSet.<init>(ObjectHashSet.java:157)
18:02:38    at uk.ac.manchester.cs.owl.owlapi.HPPCSet.<init>(MapPointer.java:444)
18:02:38    at uk.ac.manchester.cs.owl.owlapi.MapPointer.putInternal(MapPointer.java:324)
18:02:38    at uk.ac.manchester.cs.owl.owlapi.MapPointer.init(MapPointer.java:151)
18:02:38    at uk.ac.manchester.cs.owl.owlapi.MapPointer.getValues(MapPointer.java:190)
18:02:38    at uk.ac.manchester.cs.owl.owlapi.OWLImmutableOntologyImpl.getAxioms(OWLImmutableOntologyImpl.java:1325)
cmungall commented 5 years ago

We could easily split neo into multiple separate files to be read. Seems like current approach won't scale if we add swissprot.

On Thu, Jan 24, 2019 at 6:05 PM kltm notifications@github.com wrote:

Need more mem for Java:

/obo/BFO_0000040> "BFO:0000040"^^xsd:string) AnnotationAssertion(http://www.geneontology.org/formats/oboInOwl#id http://purl.obolibrary.org/obo/CHEBI_23367 "CHEBI:23367"^^xsd:string) } 18:02:38 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space 18:02:38 at com.carrotsearch.hppcrt.sets.ObjectHashSet$EntryIterator.(ObjectHashSet.java:734) 18:02:38 at com.carrotsearch.hppcrt.sets.ObjectHashSet$1.create(ObjectHashSet.java:784) 18:02:38 at com.carrotsearch.hppcrt.sets.ObjectHashSet$1.create(ObjectHashSet.java:779) 18:02:38 at com.carrotsearch.hppcrt.ObjectPool.(ObjectPool.java:74) 18:02:38 at com.carrotsearch.hppcrt.IteratorPool.(IteratorPool.java:51) 18:02:38 at com.carrotsearch.hppcrt.sets.ObjectHashSet.(ObjectHashSet.java:778) 18:02:38 at com.carrotsearch.hppcrt.sets.ObjectHashSet.(ObjectHashSet.java:157) 18:02:38 at uk.ac.manchester.cs.owl.owlapi.HPPCSet.(MapPointer.java:444) 18:02:38 at uk.ac.manchester.cs.owl.owlapi.MapPointer.putInternal(MapPointer.java:324) 18:02:38 at uk.ac.manchester.cs.owl.owlapi.MapPointer.init(MapPointer.java:151) 18:02:38 at uk.ac.manchester.cs.owl.owlapi.MapPointer.getValues(MapPointer.java:190) 18:02:38 at uk.ac.manchester.cs.owl.owlapi.OWLImmutableOntologyImpl.getAxioms(OWLImmutableOntologyImpl.java:1325)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/pipeline/issues/35#issuecomment-457429107, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOTBTHt9-NBd5dvmHC8Kovxg7ipq6ks5vGmZugaJpZM4TZDiF .

hdrabkin commented 5 years ago

Hi Seth We have another new ID in our GPI that needs to get into Noctua PR:A0A1W6AWH1

kltm commented 5 years ago

@hdrabkin I believe that this is a different issue. Your should be cleared on the completion of https://github.com/geneontology/noctua/issues/612

kltm commented 4 years ago

Previously discussed with @cmungall , we would spin out this branch into a new top-level pipeline. After starting work on that, I do not believe it's viable compared to formalizing it as a new branch in the current pipeline: it would either be a very fiddly piece of code that played carefully so as not to accidentally clobber skyhook locations or it would require a small rewrite of how skyhook works. While neither of these are insurmountable, given the small and likely temporary nature of this pipeline, I think formalizing the current branch into something slightly more permanent is the fastest and safest way forward.

kltm commented 4 years ago

Discussed with @goodb on how to make this a workable transition:

With the completion of this, we can now either build the GOlr index for go-lego in the main pipeline, or do it elsewhere. Deployment would still be once a week or so, so it may be fine to keep the degenerate pipeline-neo branch separate.

kltm commented 4 years ago

Talking to @goodb and @cmungall today on the software call. We believe that rewriting would be not that hard and likely a help in management.

kltm commented 4 years ago

@goodb Continuing conversation from geneontology/neo#53.

I think the core issue here is that we have separate processes building and using NEO--one in the old legacy pipeline and one on a branch of the new pipeline. I believe that they are out of sync. Some of this is outlined here (https://github.com/geneontology/pipeline/issues/35), but I think that the best step towards the final goal at this point is to eliminate the legacy pipeline and publication and add it to the new pipeline branch, adding publication and running it automatically. I'm having trouble pinpointing exactly where the current sync setup is failing, but reducing the moving parts would do nothing but help as far as I can tell.

By adding this to the new experimental pipeline (essentially https://github.com/geneontology/pipeline/blob/issue-35-neo-test/Jenkinsfile#L214), that should ensure that the correct file is in the correct place. Unfortunately, we'll still have the lag of tracking the last snapshot build of go-lego (which we need to solve by having uniform catalog files available to all tooling, another issue). I'm hoping to bypass this for now by just doing the imports myself.

kltm commented 4 years ago

Talking to @goodb , these are some steps we'll be taking toward completing this issue, as well as hopefully solving (or better understanding) issues around geneontology/neo#51 geneontology/neo#52 geneontology/neo#53 etc.:

(To add the above to the current "main" pipelines, we'll have to solve the various catalog issues.)

kltm commented 4 years ago

The current build using the new pipeline seems to have all of the entities listed here, so likely an improvement: https://github.com/geneontology/neo/issues/54#issuecomment-598520303

That said, there's a drop in entities: 1264812 vs 1090142. This is mostly due to the loss of a large number of CHEBI and UBERON terms. @goodb , is this expected? I believe it was, but I'm not finding a hard reference and want to confirm. CHEBI: 134308 -> 22931 UBERON: 15133 -> 4668

goodb commented 4 years ago

@kltm yes, this would generally make sense.

BTW I have started a collection of NEO/GO_LEGO queries to use to test the completion of the integrated ontology collection known some times as go-lego. Right now they are in a unit test in the branch of minerva where I am working on the go-lego_as_blazegraph concept. We should talk about how to get these or something like them into a test that can be run after this step in the pipeline .

kltm commented 4 years ago

@goodb Excellent! If you can give me a command line I can try and bolt it in. I'll need two things:

The first is more importnat as I can make the journal that I'm producing available to you for testing in the interim.

goodb commented 4 years ago

@kltm okay not a problem. I have code that does (1) and I'm sure some fiddling can turn the unit tests into something like (2), but I'm a little worried that the Minerva command line is turning into OWLTools as I keep sticking things like this into it.

If we want to pair down bloat in Minerva, We could instead use blazegraph data loader directly: https://blazegraph.com/database/apidocs/com/bigdata/rdf/store/DataLoader.html#main(java.lang.String[]) for loading purposes.

And then write something smaller that encapsulates the ontology tests - basically just some simple sparql queries.

Or I could just keep going with everyone under the Minerva roof. @balhoff any thoughts?

I wonder if there is any part of this that might intersect with Robot?

goodb commented 4 years ago

@kltm I think for 1) the best option is probably to use the native bg methods. This will do it with the current release candidate (now compatible with modern java versions). Replace your_directory_with_ontologies with, you know.. :

curl -L -o blazegraph.jar https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_2_1_6_RC/blazegraph.jar
curl -L -o blazegraph.properties https://raw.githubusercontent.com/geneontology/minerva/master/minerva-core/src/main/resources/org/geneontology/minerva/blazegraph.properties
java -cp blazegraph.jar com.bigdata.rdf.store.DataLoader -defaultGraph http://example.org blazegraph.properties   your_directory_with_ontologies

that will end up producing a blazegraph.jnl file in your working directory. There is a parameter in that blazegraph.properties file that can be used to specify a different location.
com.bigdata.journal.AbstractJournal.file=blazegraph.jnl

balhoff commented 4 years ago

@goodb just a quick pointer: I have a wrapper for blazegraph data loader: https://github.com/balhoff/blazegraph-runner

goodb commented 4 years ago

Thanks @balhoff I hadn't really looked into that before. I think you might want to update to a newer blazegraph release? https://github.com/balhoff/blazegraph-runner/blob/master/build.sbt appears to be set to 2.1.4 . 2.1.5 released a year ago and 2.1.6 seems to be working (though not officially released).

@kltm pretty sure we could do everything required for this step with a script that ran Blazegraph-runner to build the journal and then again to run through a set of sparql queries stored as files. Just need something to read the results of the queries and make a call about pass/fail. Do you have any opinions about how this is set up? e.g. should this be in its own repo somewhere or should we attach it to another project?

@balhoff this looks like it is set up for optional Arachne reasoning. I suppose we could also tune it up to run tbox reasoning using elk or whelk but my understanding is that the input here (merged-go-lego) should already contain all of the required inferences. Right?

balhoff commented 4 years ago

I think you might want to update to a newer blazegraph release?

I believe I am blocked by this issue https://github.com/blazegraph/database/issues/155

I suppose we could also tune it up to run tbox reasoning using elk or whelk but my understanding is that the input here (merged-go-lego) should already contain all of the required inferences. Right?

Yes, the published go-lego has the hierarchy computed, so should work as input to the instance reasoner.

kltm commented 4 years ago

@goodb Re: script or framework for SPARQL file pass/fail checks.

This is starting to sound like a lot of other QC we do in the pipeline. In fact, SPARQL checks like this were essentially the initial use case and scenario for the GORULEs (which have since then evolved). Ideally, at this point, we base what we do on the work done for SPARTA (tagging @dougli1sqrd to give context and maybe a super brief overview) or have a replacement that would take over from where SPARTA is.

kltm commented 4 years ago

@goodb I've tried a mess of variations around com.bigdata.journal.AbstractJournal.file=blazegraph.jnl, but have not found any that produce the desired effect. I believe that I've found a temporary workaround to keep the ball rolling (testing now), but it would be good to make sure we can pre-define all file locations.

balhoff commented 4 years ago

I've tried a mess of variations around com.bigdata.journal.AbstractJournal.file=blazegraph.jnl, but have not found any that produce the desired effect.

Not sure what problems you're having, but blazegraph-runner accepts a journal location as a command-line parameter.

goodb commented 4 years ago

@balhoff sorry I must be doing something dumb.. Having trouble with Blazegraph-runner. I've downloaded the latest release and am trying this command: ./bin/blazegraph-runner --journal mynewjournal.jnl load /Users/benjamingood/gocam_ontology/lego/

I get the error Unknown command 'mynewjournal.jnl', expected one of construct, dump, load, reason, select, update

If I rearrange the parameters such that the load command comes first as seems to be expected by that error message, I get a runtime exception.
./bin/blazegraph-runner load --journal blazegraph.jnl /Users/benjamingood/gocam_ontology/lego/ Caused by: java.io.FileNotFoundException: /Users/benjamingood/blazegraph/blazegraph-runner-1.5 (Is a directory)

@kltm you must be using this elsewhere in the pipeline where rdf.geneontology.org is populated. No?

balhoff commented 4 years ago

@goodb couple things:

goodb commented 4 years ago

Doh! just like the examples slightly farther down on the readme... thanks.

@kltm I see no reason not to use this. Download a release from https://github.com/balhoff/blazegraph-runner/releases and run the loader like this: ./bin/blazegraph-runner --journal=mynewjournal.jnl --informat=rdfxml --use-ontology-graph load /Users/benjamingood/gocam_ontology/lego/go-lego-merged-3-5-2020.owl

Awaiting comment from @dougli1sqrd regarding relation of this work to SPARTA.

dougli1sqrd commented 4 years ago

Sparta lives here: https://github.com/geneontology/go-site/tree/master/graphstore/rule-runner

There's a pretty good readme there, but essentially you can point sparta to a sparql endpoint and rules markdown and it will find any sparql defined as an implementation in the rules, run it against the endpoint, and produce a result file.

This could be easily modified to just look at sparql queries themselves. Otherwise this looks pretty close to the use case that is described above. Sparta assumes that the queries are finding errors, so if any results return, then that's considered an error.

It could be nice to wrap the queries in markdown, which this can read.

If you look here: https://raw.githubusercontent.com/geneontology/go-site/master/metadata/rules/gorule-0000007.md

You can see that there's sparql defined in the YAML part of the file, and then there's a description below in markdown. We've called this "yamldown", and we have a tool for that too: https://github.com/dougli1sqrd/yamldown.

This just loads the yamldown into the dictionary yaml parts and the text markdown parts. This library is used by sparta above.

goodb commented 4 years ago

@kltm just trying to think what is the most sane way to add a simple test here. sparta seems like it would require some reworking and not sure its worth it. We would need to launch a sparql server for it to talk to and modify it as the queries here should return results, not emptiness, if things are good. Not big things really, but impression is that it would be easier to do it differently.

Looking at https://github.com/geneontology/pipeline/blob/issue-35-neo-test/Jenkinsfile I think it would be a good idea to shift it to Blazegraph-runner. e.g.

curl -L -o blazegraph-runner-1.5.tgz https://github.com/balhoff/blazegraph-runner/releases/download/v1.5/blazegraph-runner-1.5.tgz tar -xvf blazegraph-runner-1.5.tgz blazegraph-runner-1.5/bin/blazegraph-runner --journal=/tmp/blazegraph.jnl --informat=rdfxml --use-ontology-graph load /tmp/go-lego.owl

Now it is ready to run sparql checks from a provided file - e.g. ./bin/blazegraph-runner --journal=/tmp/blazegraph.jnl --outformat=tsv select my-sparql-test.rq my-sparql-out.tsv

I imagine the logic for the test could be handled and stored directly in the groovy for the pipeline run? (e.g. If my-sparql-out.tsv contains x007 then all good, else bad).

That is what my disturbed mind arrives at for the solution with the least code to write and the least moving parts. Thoughts?

kltm commented 4 years ago

@goodb I'm of two minds about this.

What we're essentially adding to the pipeline is the first "ontology sanity check"--it will likely have its own stage after the two ontologies are built. The consumer of the checks are most likely people like you, maybe some other ontology folks. Given that, it doesn't necessarily have to be pretty and user friendly; it just has to stop dead if there are problems and bring the pipeline down with it. So whatever works for you--it's not like we're making much use of SPARTA at the moment anyways.

On the other hand, I'm uncomfortable with doing something that is not reusing current patterns unless there is a good reason. We have go rules, a checker, a centralized place for descriptions and metadata, logging format, etc. While it is "extra work", it reduces the number of snowflakes and makes things hang together a bit more. One thing I'd like to avoid is start embedding actually testing in the Groovy. While technically possible, I want to be able to run things on the command line and get 1) a report and 2) a binary good/bad answer. This should be portable to my CLI or Travis or anywhere else.

I think there are a few ways forward here. 1) This could be taken care of in blazegraph-runner, extending it to do what we want. 2) A new checker script, as simple as desired. 3) Bend SPARTA to do what we want. As well, while it's nice to just have SPARQL files that run, it would be good to have documentation about what they check, why, etc.; I'd like to avoid depending on a specific person knowing where things are and why.

goodb commented 4 years ago

@dougli1sqrd @balhoff here are the sparql queries and expected results I'd like to start as tests. I'm sure there are many ways to do this, but this should give an idea of the goal and some things to work with. Hope Robot can help here.

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?super WHERE { http://purl.obolibrary.org/obo/ECO_0000314 rdfs:subClassOf* ?super . }

?super should contain http://purl.obolibrary.org/obo/ECO_0000000

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?super WHERE { http://purl.obolibrary.org/obo/WBbt_0005753 rdfs:subClassOf* ?super . }

?super should contain http://purl.obolibrary.org/obo/CL_0000003 ?super should contain http://purl.obolibrary.org/obo/CARO_0000000

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?super WHERE { http://purl.obolibrary.org/obo/GO_0000776 rdfs:subClassOf* ?super . }

?super should contain http://purl.obolibrary.org/obo/GO_0110165 ?super should contain http://purl.obolibrary.org/obo/GO_0005575

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?super WHERE { http://purl.obolibrary.org/obo/GO_0022607 rdfs:subClassOf* ?super . }

?super should contain http://purl.obolibrary.org/obo/GO_0008150

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?super WHERE { http://purl.obolibrary.org/obo/GO_0060090 rdfs:subClassOf* ?super . }

?super should contain http://purl.obolibrary.org/obo/GO_0003674

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?super WHERE { http://identifiers.org/uniprot/Q13253 rdfs:subClassOf* ?super . }

?super should contain http://purl.obolibrary.org/obo/PR_000000001 ?super should contain http://purl.obolibrary.org/obo/CHEBI_36080 ?super should contain http://purl.obolibrary.org/obo/CHEBI_33695 ?super should contain http://purl.obolibrary.org/obo/CHEBI_24431

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?super WHERE { http://identifiers.org/zfin/ZDB-GENE-010410-3 rdfs:subClassOf* ?super . }

?super should contain http://purl.obolibrary.org/obo/CHEBI_33695 ?super should contain http://purl.obolibrary.org/obo/CHEBI_24431

PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?super WHERE { http://identifiers.org/wormbase/WBGene00000275 rdfs:subClassOf* ?super . }

?super should contain http://purl.obolibrary.org/obo/CHEBI_33695 ?super should contain http://purl.obolibrary.org/obo/CHEBI_24431

kltm commented 4 years ago

@dougli1sqrd ^ Next steps is that the two of us work out (with the help of @balhoff) how to satisfactorily wire this into the pipeline, as well as try variations (e.g. tdb for jena) to see how performant they are.

goodb commented 4 years ago

@kltm probably not a huge deal, but it does seem slightly better to find a way to run tests on the same sparql server implementation as we will be using downstream. e.g. if we are using blazegraph and a particular set up for the named graphs within it, it would be nice to run the tests on that rather than TDB or direct on OWL. Chances of a problem in that respect do seem very low.. but still a consideration.

kltm commented 4 years ago

@goodb I totally agree in that we should do both as we have need and capacity.

dougli1sqrd commented 4 years ago

@goodb @kltm where should these queries live? When Robot is run, it'll want to point at a directory of queries. Should we clone into a repository that has them, or also download them from a location from skyhook (and we would also have to place them there)?

balhoff commented 4 years ago

I thought we had ended up thinking this could be added to the ontology makefile, as a validation step that requires extensions/go-lego.owl. If that's true you can put them in the normal sparql folder there.

kltm commented 4 years ago

@goodb @dougli1sqrd If, for the moment, the queries are all NEO-related, they can live in neo. If for the overall ontology set, then go-ontology.

goodb commented 4 years ago

The queries are not just neo. they span eg cell and anatomy and evidence as well.

goodb commented 4 years ago

@kltm when you can, could you give me a stable URL where code can expect to find a go-lego blazegraph journal? I can always build one locally from the go-lego.owl file but I'd rather use the pipeline product when its ready. This will help ensure everything is in sync and conveniently take away the journal build step from travis tests.

kltm commented 4 years ago

@goodb Until we have bolted it into the main pipeline, there is not a completely stable URL. The current location (http://skyhook.berkeleybop.org/issue-35-neo-test/products/blazegraph/) is good except for a few hours Friday afternoons or if a build fails. We could also make a temporary drop point, until it goes into the main pipelines. You should then make a ticket to undo that tied to the closure of this ticket.