Open dustine32 opened 2 years ago
From @nmarkari, add another column for organism. Hopefully NCBITaxon ID will do.
Still working on this table query. @balhoff just explained that I can isolate all gene product classes in a model by looking for the category tags:
<geneID> <vocab:category> <https://w3id.org/biolink/vocab/GeneProduct>
<geneID> <vocab:category> <https://w3id.org/biolink/vocab/MacromolecularMachine>
However, these are not loaded into the RDF endpoints right now because NEO is not currently loaded. In the short-term, I will use a local blazegraph-runner instance with NEO and the GO-CAM blazegraph-production.jnl to run my query and generate the table.
@nmarkari Attached is the TSV I promised too long ago: gocams2genes_20220726.txt
Due to the current limitation with the RDF endpoints mentioned above (missing vocab:category
tags), here's the SOP for generating this GO-CAM -> gene product TSV using blazegraph-runner, blazegraph-production.jnl, and NEO:
# First download blazegraph-runner
wget https://github.com/balhoff/blazegraph-runner/releases/download/v1.6.5/blazegraph-runner-1.6.5.tgz
tar -zxf blazegraph-runner-1.6.5.tgz
# Fetch journals
wget http://skyhook.berkeleybop.org/issue-35-neo-test/ontology/neo.owl
wget http://current.geneontology.org/products/blazegraph/blazegraph-production.jnl.gz
gunzip blazegraph-production.jnl.gz
# Load in NEO
mv blazegraph-production.jnl blazegraph-production-neo.jnl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-production-neo.jnl --informat=rdfxml neo.owl
# Query
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner select --journal=blazegraph-production-neo.jnl --outformat=tsv gocams2genes.rq gocams2genes.tsv
The gocams2genes.rq
query is saved in a public gist for now.
Of course, since this is the production
journal being queried, it won't have the Reactome models because they're still modelstate=development
. To access those, just swap in blazegraph-internal.jnl. We also still need a query specific for Reactome or all metabolic pathways.
@nmarkari Don't worry about having to run the whole SOP above for now. I can just provide the output file to you. Let me know if you spot anything off about the attached gocams2genes_20220726.txt
file (e.g. models missing, genes missing).
@nmarkari I updated the query to aggregate multiple taxons into the same line, which prevents multiple lines per model. Attached new results file: gocams2genes_20220728.txt
@nmarkari I started work on a separate query for metabolic pathway GO-CAMs here. It looks for models containing:
REACTO:molecular_event
[some class]
is not currently constrained to any closure (e.g. "small molecule") nor does it need to be the same instance (just the same class)My working example is http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-997272.
There is currently no label column for the models' genes (and confusingly I put the gene IDs in the ?controllers
column) because there's more work needed to dig out the UniProt IDs from reacto.owl
. To be clear, I used the blazegraph-internal.jnl file along with loading neo.owl
and reacto.owl
:
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-internal-neo-reacto.jnl --informat=rdfxml neo.owl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-internal-neo-reacto.jnl --informat=rdfxml reacto.owl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner select --journal=blazegraph-internal-neo-reacto.jnl --outformat=tsv metabolic_gocams2genes.rq metabolic_gocams2genes.tsv
Attached is the results list: metabolic_gocams2genes_20220728.txt This fetches 1100 models: 1083 Reactome and 17 manually curated.
Working with @nmarkari, a basic table of GO-CAM to gene data should be generated as an input to enrichment test.
Some detail from @nmarkari:
This table can be produced via a query to one of the GO SPARQL endpoints.
Tagging @vanaukenk @cmungall @balhoff @kltm