Produce TSV of GO-CAM gene data for enrichment test #1884

dustine32 commented 2 years ago

Working with @nmarkari, a basic table of GO-CAM to gene data should be generated as an input to enrichment test.

Some detail from @nmarkari:

I'd like a table that for all causal models in production state or from reactome2go has the following columns: GO CAM id, GO CAM title, gene product identifier, gene product name. I'm using the word "gene" loosely here to mean the object of the "enabled by" relation, which I understand could include protein complexes or gene products. Essentially, I'd want to add a column to this table to include the gene names, and I don't need the name of the node in the go cam (prod1). image I am not sure if a gene could have more than one identifier; if so, I am hoping to resolve any cases where a gene has more than one identifier by using some sort of an internal ID or a gene name.

This table can be produced via a query to one of the GO SPARQL endpoints.

Tagging @vanaukenk @cmungall @balhoff @kltm

dustine32 commented 2 years ago

From @nmarkari, add another column for organism. Hopefully NCBITaxon ID will do.

dustine32 commented 1 year ago

Still working on this table query. @balhoff just explained that I can isolate all gene product classes in a model by looking for the category tags:

<geneID> <vocab:category> <>
<geneID> <vocab:category> <>

However, these are not loaded into the RDF endpoints right now because NEO is not currently loaded. In the short-term, I will use a local blazegraph-runner instance with NEO and the GO-CAM blazegraph-production.jnl to run my query and generate the table.

dustine32 commented 1 year ago

@nmarkari Attached is the TSV I promised too long ago: gocams2genes_20220726.txt

Due to the current limitation with the RDF endpoints mentioned above (missing vocab:category tags), here's the SOP for generating this GO-CAM -> gene product TSV using blazegraph-runner, blazegraph-production.jnl, and NEO:

# First download blazegraph-runner
tar -zxf blazegraph-runner-1.6.5.tgz
# Fetch journals
gunzip blazegraph-production.jnl.gz
# Load in NEO
mv blazegraph-production.jnl blazegraph-production-neo.jnl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-production-neo.jnl --informat=rdfxml neo.owl
# Query
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner select --journal=blazegraph-production-neo.jnl --outformat=tsv gocams2genes.rq gocams2genes.tsv

The gocams2genes.rq query is saved in a public gist for now.

Of course, since this is the production journal being queried, it won't have the Reactome models because they're still modelstate=development. To access those, just swap in blazegraph-internal.jnl. We also still need a query specific for Reactome or all metabolic pathways.

@nmarkari Don't worry about having to run the whole SOP above for now. I can just provide the output file to you. Let me know if you spot anything off about the attached gocams2genes_20220726.txt file (e.g. models missing, genes missing).

dustine32 commented 1 year ago

@nmarkari I updated the query to aggregate multiple taxons into the same line, which prevents multiple lines per model. Attached new results file: gocams2genes_20220728.txt

dustine32 commented 1 year ago

@nmarkari I started work on a separate query for metabolic pathway GO-CAMs here. It looks for models containing:

  1. Two activities of either MF descendant or REACTO:molecular_event
  2. These two activities are connected via a causal relation (e.g. "directly provides input for")
  3. Activity1 -has_output-> [some class] <-has_input- Activity2
  4. The [some class] is not currently constrained to any closure (e.g. "small molecule") nor does it need to be the same instance (just the same class)

My working example is

There is currently no label column for the models' genes (and confusingly I put the gene IDs in the ?controllers column) because there's more work needed to dig out the UniProt IDs from reacto.owl. To be clear, I used the blazegraph-internal.jnl file along with loading neo.owl and reacto.owl:

JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-internal-neo-reacto.jnl --informat=rdfxml neo.owl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner load --journal=blazegraph-internal-neo-reacto.jnl --informat=rdfxml reacto.owl
JAVA_OPTS=-Xmx12G ./bin/blazegraph-runner select --journal=blazegraph-internal-neo-reacto.jnl --outformat=tsv metabolic_gocams2genes.rq metabolic_gocams2genes.tsv

Attached is the results list: metabolic_gocams2genes_20220728.txt This fetches 1100 models: 1083 Reactome and 17 manually curated.