Produce TSV of GO-CAM gene data for enrichment test

dustine32 commented 2 years ago

Working with @nmarkari, a basic table of GO-CAM to gene data should be generated as an input to enrichment test.

Some detail from @nmarkari:

I'd like a table that for all causal models in production state or from reactome2go has the following columns: GO CAM id, GO CAM title, gene product identifier, gene product name. I'm using the word "gene" loosely here to mean the object of the "enabled by" relation, which I understand could include protein complexes or gene products. Essentially, I'd want to add a column to this table to include the gene names, and I don't need the name of the node in the go cam (prod1). I am not sure if a gene could have more than one identifier; if so, I am hoping to resolve any cases where a gene has more than one identifier by using some sort of an internal ID or a gene name.

This table can be produced via a query to one of the GO SPARQL endpoints.

Tagging @vanaukenk @cmungall @balhoff @kltm

dustine32 commented 2 years ago

From @nmarkari, add another column for organism. Hopefully NCBITaxon ID will do.

dustine32 commented 1 year ago

Still working on this table query. @balhoff just explained that I can isolate all gene product classes in a model by looking for the category tags:

<geneID> <vocab:category> <https://w3id.org/biolink/vocab/GeneProduct>
<geneID> <vocab:category> <https://w3id.org/biolink/vocab/MacromolecularMachine>

However, these are not loaded into the RDF endpoints right now because NEO is not currently loaded. In the short-term, I will use a local blazegraph-runner instance with NEO and the GO-CAM blazegraph-production.jnl to run my query and generate the table.

dustine32 commented 1 year ago

@nmarkari Attached is the TSV I promised too long ago: gocams2genes_20220726.txt

Due to the current limitation with the RDF endpoints mentioned above (missing vocab:category tags), here's the SOP for generating this GO-CAM -> gene product TSV using blazegraph-runner, blazegraph-production.jnl, and NEO:

# First download blazegraph-runner
wget https://github.com/balhoff/blazegraph-runner/releases/download/v1.6.5/blazegraph-runner-1.6.5.tgz
tar -zxf blazegraph-runner-1.6.5.tgz

# Fetch journals
wget http://skyhook.berkeleybop.org/issue-35-neo-test/ontology/neo.owl
wget http://current.geneontology.org/products/blazegraph/blazegraph-production.jnl.gz
gunzip blazegraph-production.jnl.gz
# Load in NEO
mv blazegraph-production.jnl blazegraph-production-neo.jnl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-production-neo.jnl --informat=rdfxml neo.owl
# Query
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner select --journal=blazegraph-production-neo.jnl --outformat=tsv gocams2genes.rq gocams2genes.tsv

The gocams2genes.rq query is saved in a public gist for now.

Of course, since this is the production journal being queried, it won't have the Reactome models because they're still modelstate=development. To access those, just swap in blazegraph-internal.jnl. We also still need a query specific for Reactome or all metabolic pathways.

@nmarkari Don't worry about having to run the whole SOP above for now. I can just provide the output file to you. Let me know if you spot anything off about the attached gocams2genes_20220726.txt file (e.g. models missing, genes missing).

dustine32 commented 1 year ago

@nmarkari I updated the query to aggregate multiple taxons into the same line, which prevents multiple lines per model. Attached new results file: gocams2genes_20220728.txt

dustine32 commented 1 year ago

@nmarkari I started work on a separate query for metabolic pathway GO-CAMs here. It looks for models containing:

Two activities of either MF descendant or REACTO:molecular_event
These two activities are connected via a causal relation (e.g. "directly provides input for")
Activity1 -has_output-> [some class] <-has_input- Activity2
The [some class] is not currently constrained to any closure (e.g. "small molecule") nor does it need to be the same instance (just the same class)

My working example is http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-997272.

There is currently no label column for the models' genes (and confusingly I put the gene IDs in the ?controllers column) because there's more work needed to dig out the UniProt IDs from reacto.owl. To be clear, I used the blazegraph-internal.jnl file along with loading neo.owl and reacto.owl:

JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-internal-neo-reacto.jnl --informat=rdfxml neo.owl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-internal-neo-reacto.jnl --informat=rdfxml reacto.owl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner select --journal=blazegraph-internal-neo-reacto.jnl --outformat=tsv metabolic_gocams2genes.rq metabolic_gocams2genes.tsv

Attached is the results list: metabolic_gocams2genes_20220728.txt This fetches 1100 models: 1083 Reactome and 17 manually curated.

geneontology / go-site

Produce TSV of GO-CAM gene data for enrichment test #1884