Knowledge-Graph-Hub / kg-covid-19

An instance of KG Hub to produce a knowledge graph for COVID-19 response.
https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki
BSD 3-Clause "New" or "Revised" License
79 stars 26 forks source link

Ingest SARS-CoV-2 GO-CAMs #153

Closed cmungall closed 4 years ago

cmungall commented 4 years ago

See also #132 #104

could be brought in different ways; experiment with how this affects connectivity, ML inference

justaddcoffee commented 4 years ago

@cmungall what URLs do you recommend for ingesting RDF, causal tab, MI tab? Can't seem to find this on the GO site

cmungall commented 4 years ago

cc @goodb @balhoff

balhoff commented 4 years ago

@justaddcoffee I usually work with the GO-CAMs by cloning the noctua-models repo. There is also a download here (beta GO-CAM site): https://geneontology.cloud/home

I have a prototype ingest pipeline for Translator here: https://github.com/TranslatorIIPrototypes/cam-pipeline/blob/master/Makefile

It isn't really polished; that was to support our Translator proposal.

cmungall commented 4 years ago

Some general design patterns for going for granular U shapes to direct edges (this may be too abstract at this stage):

https://docs.google.com/document/d/1kkwirzGCkdZdHBLDKbuXusDhfoC1Sl6WynzzwnFMLSI/edit

cmungall commented 4 years ago

@lpalbou @dustine32 we have causaltab->gocam, do we have the reverse?

dustine32 commented 4 years ago

@cmungall Sorry I don't know of anything coded yet for gocam->causaltab. https://github.com/geneontology/signor2gocam might eventually do that though I'm guessing you'll lose data like modification site.

justaddcoffee commented 4 years ago

As a first go at this, we could save this Noctua model: http://noctua.geneontology.org/editor/graph/gomodel:5e72450500004019 commit it to this repo, then ingest from there.

Do you all recommend we ingest OWL or GPAD?

goodb commented 4 years ago

@justaddcoffee I'm curious to know more about what you want to do with the go-cams. Sorry I haven't had time to follow the various threads.. What queries do you want to support that can leverage them? If I had a better sense of that I could probably help (and would like to).

I would strongly advise the use of the OWL files as the starting point for your ingest pipeline. In the worst case you can feed them into a single minerva command line command and get to the GPAD representation that way. Best case you can use them in their full expressivity to better meet your needs. I think its worth having a look at @balhoff's makefile above as a starting point. I suspect you might want to do some SPARQL querying over the RDF stores generated by that makefile to produce a structure that will suit your purpose. That path (OWL/RDF->Reasoner->triplestore->query->triplestore||graphstore||rdbms||textfile||...) seems to me a good way to go about this. I would LOVE to see people start using native go-cams (not GPAD) as the starting point for using the knowledge in them. If you did, perhaps that could be the starting point for others to work from.

justaddcoffee commented 4 years ago

great, thanks @goodb !

What queries do you want to support that can leverage them? If I had a better sense of that I could probably help (and would like to).

I think our goal here is a richer representation of SARS-CoV-2 gene annotations to improve performance of ML - i.e., so that random walks we generate for ML traverse the nodes that best represent what these genes are really up to. Right now we have GO annotations and human interaction partners, but we can probably do a lot better that this with GO-CAM models.

Do you have time to chat for a minute today?

justaddcoffee commented 4 years ago

@justaddcoffee I usually work with the GO-CAMs by cloning the noctua-models repo. There is also a download here (beta GO-CAM site): https://geneontology.cloud/home

I have a prototype ingest pipeline for Translator here: https://github.com/TranslatorIIPrototypes/cam-pipeline/blob/master/Makefile

Thanks @balhoff ! It seems like maybe as a first pass I could do my own little transform from the OWL as Ben suggested, or maybe the TTL from here? https://geneontology.cloud/browse

balhoff commented 4 years ago

from the OWL as Ben suggested, or maybe the TTL from here

Just to avoid confusion—these are one and the same.

goodb commented 4 years ago

@justaddcoffee keep in mind that the .cloud site only links to models from the 'master' branch so you will not have any human reactome models nor any of the work in progress imports from mouse and worm. To access these, you can clone the dev branch from noctua-models. I think it would be interesting to see if the work that @dustine32 did on converting the annotation extensions into GO-CAM edges would be useful for you. And I think there is quite a bit of knowledge you could get from the reactome models.

And yes, ttl is just one of many serializations of RDF which is one of the serializations of OWL...

callahantiff commented 4 years ago

Happy to help on this. @justaddcoffee - maybe we can set up some time to meet and discuss details on this?

justaddcoffee commented 4 years ago

Sure, need to have @goodb too. Might be easiest though to use this ticket to communicate asynchronously, or else we will descend into doodle poll hell...

goodb commented 4 years ago

@justaddcoffee @callahantiff what is the task at hand right now?

justaddcoffee commented 4 years ago

I think what we need is to identify a downloadable file for GO-CAM models that we can use to ingest, transform and put in our knowledge graph.

@justaddcoffee I usually work with the GO-CAMs by cloning the noctua-models repo. I have a prototype ingest pipeline for Translator here: https://github.com/TranslatorIIPrototypes/cam-pipeline/blob/master/Makefile

I think @balhoff says ^here that the way to do this would be to clone the noctua-models repo - I'm not clear, but I think we can then produce a file we can ingest.

goodb commented 4 years ago

With the master and dev branches of the noctua-models repo cloned, the first 16 lines of his makefile will produce a blazegraph journal with all the models you need in it. Potential next steps (unordered):

justaddcoffee commented 4 years ago

Okay, @callahantiff, does that sound doable? I'd say if @callahantiff can do that and produce a file, we can put this somewhere downloadable (e.g. berkeleybop.io) and ingest from there.

Eventually I guess we'd want a way to update this file automatically without these steps, but I this would work just to get the GO-CAM models into kg-covid-19

goodb commented 4 years ago

@justaddcoffee one question that will come up will be the graph structure to convert things to. One option we've discussed before was just to use the columns from GPAD. In that case there is already a minerva-cli command to do it. The sparql / java code for that conversion might be a good starting point in any case.

@cmungall has said on a few occasions that he has a pretty specific mapping to graph in mind for go-cams but I haven't seen it written down anywhere yet.

Eventually I guess we'd want a way to update this file automatically without these steps, but I this would work just to get the GO-CAM models into kg-covid-19

The steps above are not very compute heavy, I suspect starting the same process over again from updated files would be the easiest path to automation.

If there was a bottleneck we could probably make something that would work on one ttl file at a time and support parallel processing.

callahantiff commented 4 years ago

Okay, @callahantiff, does that sound doable? I'd say if @callahantiff can do that and produce a file, we can put this somewhere downloadable (e.g. berkeleybop.io) and ingest from there.

Eventually I guess we'd want a way to update this file automatically without these steps, but I this would work just to get the GO-CAM models into kg-covid-19

Sorry for being slow to respond. @justaddcoffee this sounds reasonable to me. I have a deadline Tuesday afternoon. Would it be OK to check-in with you Wednesday morning before I get started?

justaddcoffee commented 4 years ago

Sure @callahantiff, glad to meet on Wednesday whenever it's convenient

goodb commented 4 years ago

noting this. https://docs.google.com/document/d/1kkwirzGCkdZdHBLDKbuXusDhfoC1Sl6WynzzwnFMLSI/edit

justaddcoffee commented 4 years ago

I've put @callahantiff and Bill's RDF/XML of the GO-CAM models up on a separate github repo: https://github.com/justaddcoffee/go-cam-models-kg-covid-19/raw/master/lifted-go-cams-20200619.xml.gz We can ingest from here, and also @callahantiff can push to this repo if this RDF/XML needs to be updated

callahantiff commented 4 years ago

I've put @callahantiff and Bill's RDF/XML of the GO-CAM models up on a separate github repo: https://github.com/justaddcoffee/go-cam-models-kg-covid-19/raw/master/lifted-go-cams-20200619.xml.gz We can ingest from here, and also @callahantiff can push to this repo if this RDF/XML needs to be updated

Thanks @justaddcoffee, this is great!

callahantiff commented 4 years ago

For posterity, here is an copy of the SPAQRL screenshot and updated counts for the current file mentioned above.

The RDF/XML should contain the lifted models for all production GO-CAMs and for the human Reactome GO-CAMs (downloaded on June 19th). A screenshot of the query is also attached.

Looks like there are 4,587 U-patterns identified from 1,057 unique models.

Screen Shot 2020-07-02 at 4 11 10 PM

justaddcoffee commented 4 years ago

GO-CAM data are live in the latest KG-COVID-19 build!

callahantiff commented 4 years ago

Awesome! Did everything port OK?

justaddcoffee commented 4 years ago

Yes, seems to have, but some exploratory querying might be in order

deepakunni3 commented 4 years ago

@callahantiff Yes, I hope so. Would be great to have your feedback on the final transformed files.

@justaddcoffee What was the URL to access the transformed nodes.tsv and edges.tsv from a build on Jenkins?

justaddcoffee commented 4 years ago

@justaddcoffee What was the URL to access the transformed nodes.tsv and edges.tsv from a build on Jenkins?

They should be at: http://kg-hub.berkeleybop.io/transformed/[ingest name]/[nodes|edges].tsv

deepakunni3 commented 4 years ago

@justaddcoffee Thanks! đź‘Ť

@callahantiff, FYI http://kg-hub.berkeleybop.io/transformed/GOCAMs/GOCAMs_nodes.tsv http://kg-hub.berkeleybop.io/transformed/GOCAMs/GOCAMs_edges.tsv

goodb commented 4 years ago

Did it help improve any predictions ?

justaddcoffee commented 4 years ago

No data on this yet Ben - we'll keep you posted!

callahantiff commented 4 years ago

http://kg-hub.berkeleybop.io/transformed/GOCAMs/GOCAMs_edges.tsv

Looking at the files they seem great to me. I really like viewing them in this representation. Thanks for doing this!

goodb commented 4 years ago

if reactome imports ended up used here, this is relevant. worth an update https://github.com/geneontology/pathways2GO/issues/104