Closed cmungall closed 4 years ago
@cmungall what URLs do you recommend for ingesting RDF, causal tab, MI tab? Can't seem to find this on the GO site
cc @goodb @balhoff
@justaddcoffee I usually work with the GO-CAMs by cloning the noctua-models repo. There is also a download here (beta GO-CAM site): https://geneontology.cloud/home
I have a prototype ingest pipeline for Translator here: https://github.com/TranslatorIIPrototypes/cam-pipeline/blob/master/Makefile
It isn't really polished; that was to support our Translator proposal.
Some general design patterns for going for granular U shapes to direct edges (this may be too abstract at this stage):
https://docs.google.com/document/d/1kkwirzGCkdZdHBLDKbuXusDhfoC1Sl6WynzzwnFMLSI/edit
@lpalbou @dustine32 we have causaltab->gocam, do we have the reverse?
@cmungall Sorry I don't know of anything coded yet for gocam->causaltab. https://github.com/geneontology/signor2gocam might eventually do that though I'm guessing you'll lose data like modification site.
As a first go at this, we could save this Noctua model: http://noctua.geneontology.org/editor/graph/gomodel:5e72450500004019 commit it to this repo, then ingest from there.
Do you all recommend we ingest OWL or GPAD?
@justaddcoffee I'm curious to know more about what you want to do with the go-cams. Sorry I haven't had time to follow the various threads.. What queries do you want to support that can leverage them? If I had a better sense of that I could probably help (and would like to).
I would strongly advise the use of the OWL files as the starting point for your ingest pipeline. In the worst case you can feed them into a single minerva command line command and get to the GPAD representation that way. Best case you can use them in their full expressivity to better meet your needs. I think its worth having a look at @balhoff's makefile above as a starting point. I suspect you might want to do some SPARQL querying over the RDF stores generated by that makefile to produce a structure that will suit your purpose. That path (OWL/RDF->Reasoner->triplestore->query->triplestore||graphstore||rdbms||textfile||...) seems to me a good way to go about this. I would LOVE to see people start using native go-cams (not GPAD) as the starting point for using the knowledge in them. If you did, perhaps that could be the starting point for others to work from.
great, thanks @goodb !
What queries do you want to support that can leverage them? If I had a better sense of that I could probably help (and would like to).
I think our goal here is a richer representation of SARS-CoV-2 gene annotations to improve performance of ML - i.e., so that random walks we generate for ML traverse the nodes that best represent what these genes are really up to. Right now we have GO annotations and human interaction partners, but we can probably do a lot better that this with GO-CAM models.
Do you have time to chat for a minute today?
@justaddcoffee I usually work with the GO-CAMs by cloning the noctua-models repo. There is also a download here (beta GO-CAM site): https://geneontology.cloud/home
I have a prototype ingest pipeline for Translator here: https://github.com/TranslatorIIPrototypes/cam-pipeline/blob/master/Makefile
Thanks @balhoff ! It seems like maybe as a first pass I could do my own little transform from the OWL as Ben suggested, or maybe the TTL from here? https://geneontology.cloud/browse
from the OWL as Ben suggested, or maybe the TTL from here
Just to avoid confusion—these are one and the same.
@justaddcoffee keep in mind that the .cloud site only links to models from the 'master' branch so you will not have any human reactome models nor any of the work in progress imports from mouse and worm. To access these, you can clone the dev branch from noctua-models. I think it would be interesting to see if the work that @dustine32 did on converting the annotation extensions into GO-CAM edges would be useful for you. And I think there is quite a bit of knowledge you could get from the reactome models.
And yes, ttl is just one of many serializations of RDF which is one of the serializations of OWL...
Happy to help on this. @justaddcoffee - maybe we can set up some time to meet and discuss details on this?
Sure, need to have @goodb too. Might be easiest though to use this ticket to communicate asynchronously, or else we will descend into doodle poll hell...
@justaddcoffee @callahantiff what is the task at hand right now?
I think what we need is to identify a downloadable file for GO-CAM models that we can use to ingest, transform and put in our knowledge graph.
@justaddcoffee I usually work with the GO-CAMs by cloning the noctua-models repo. I have a prototype ingest pipeline for Translator here: https://github.com/TranslatorIIPrototypes/cam-pipeline/blob/master/Makefile
I think @balhoff says ^here that the way to do this would be to clone the noctua-models repo - I'm not clear, but I think we can then produce a file we can ingest.
With the master and dev branches of the noctua-models repo cloned, the first 16 lines of his makefile will produce a blazegraph journal with all the models you need in it. Potential next steps (unordered):
Okay, @callahantiff, does that sound doable? I'd say if @callahantiff can do that and produce a file, we can put this somewhere downloadable (e.g. berkeleybop.io) and ingest from there.
Eventually I guess we'd want a way to update this file automatically without these steps, but I this would work just to get the GO-CAM models into kg-covid-19
@justaddcoffee one question that will come up will be the graph structure to convert things to. One option we've discussed before was just to use the columns from GPAD. In that case there is already a minerva-cli command to do it. The sparql / java code for that conversion might be a good starting point in any case.
@cmungall has said on a few occasions that he has a pretty specific mapping to graph in mind for go-cams but I haven't seen it written down anywhere yet.
Eventually I guess we'd want a way to update this file automatically without these steps, but I this would work just to get the GO-CAM models into kg-covid-19
The steps above are not very compute heavy, I suspect starting the same process over again from updated files would be the easiest path to automation.
If there was a bottleneck we could probably make something that would work on one ttl file at a time and support parallel processing.
Okay, @callahantiff, does that sound doable? I'd say if @callahantiff can do that and produce a file, we can put this somewhere downloadable (e.g. berkeleybop.io) and ingest from there.
Eventually I guess we'd want a way to update this file automatically without these steps, but I this would work just to get the GO-CAM models into kg-covid-19
Sorry for being slow to respond. @justaddcoffee this sounds reasonable to me. I have a deadline Tuesday afternoon. Would it be OK to check-in with you Wednesday morning before I get started?
Sure @callahantiff, glad to meet on Wednesday whenever it's convenient
I've put @callahantiff and Bill's RDF/XML of the GO-CAM models up on a separate github repo: https://github.com/justaddcoffee/go-cam-models-kg-covid-19/raw/master/lifted-go-cams-20200619.xml.gz We can ingest from here, and also @callahantiff can push to this repo if this RDF/XML needs to be updated
I've put @callahantiff and Bill's RDF/XML of the GO-CAM models up on a separate github repo: https://github.com/justaddcoffee/go-cam-models-kg-covid-19/raw/master/lifted-go-cams-20200619.xml.gz We can ingest from here, and also @callahantiff can push to this repo if this RDF/XML needs to be updated
Thanks @justaddcoffee, this is great!
For posterity, here is an copy of the SPAQRL screenshot and updated counts for the current file mentioned above.
The RDF/XML should contain the lifted models for all production GO-CAMs and for the human Reactome GO-CAMs (downloaded on June 19th). A screenshot of the query is also attached.
Looks like there are 4,587 U-patterns identified from 1,057 unique models.
GO-CAM data are live in the latest KG-COVID-19 build!
Awesome! Did everything port OK?
Yes, seems to have, but some exploratory querying might be in order
@callahantiff Yes, I hope so. Would be great to have your feedback on the final transformed files.
@justaddcoffee What was the URL to access the transformed nodes.tsv and edges.tsv from a build on Jenkins?
@justaddcoffee What was the URL to access the transformed nodes.tsv and edges.tsv from a build on Jenkins?
They should be at:
http://kg-hub.berkeleybop.io/transformed/[ingest name]/[nodes|edges].tsv
@justaddcoffee Thanks! đź‘Ť
@callahantiff, FYI http://kg-hub.berkeleybop.io/transformed/GOCAMs/GOCAMs_nodes.tsv http://kg-hub.berkeleybop.io/transformed/GOCAMs/GOCAMs_edges.tsv
Did it help improve any predictions ?
No data on this yet Ben - we'll keep you posted!
http://kg-hub.berkeleybop.io/transformed/GOCAMs/GOCAMs_edges.tsv
Looking at the files they seem great to me. I really like viewing them in this representation. Thanks for doing this!
if reactome imports ended up used here, this is relevant. worth an update https://github.com/geneontology/pathways2GO/issues/104
See also #132 #104
could be brought in different ways; experiment with how this affects connectivity, ML inference