StanfordBioinformatics / trellis-mvp-functions

Trellis serverless data management framework for variant calling of VA MVP whole-genome sequencing data.
6 stars 1 forks source link

Ubam nodes were not added to the database #27

Closed pbilling closed 2 years ago

pbilling commented 2 years ago

The problem

// Cypher query to categorize FastqToUbam jobs without Ubam outputs
MATCH (job:Job:FastqToUbam)-[:STATUS]->(d:Dstat)
WHERE NOT (job)-[:GENERATED]->(:Ubam)
RETURN COUNT(DISTINCT job), d.status
COUNT(DISTINCT job)    d.status
625    "FAILURE"
733    "RUNNING"
19673    "SUCCESS"

We can see three different cases here. For "FAILURE", we don't expect an output. For "SUCCESS" we do expect an output and for "RUNNING" we maybe expect an output. Those jobs are not running but the end result was not recorded in the database so we don't know whether they succeeded or failed.

Right now, I am going to focus on finding and adding Ubams to the database for successful jobs since those are the majority of cases.

pbilling commented 2 years ago

Troubleshooting plan

From the query, I know there is a FastqToUbam job node for these Ubams and that it succeeded, so I can expect that there is a Ubam object in cloud storage and I just need to register it in the database.

Because the (:FastqToUbam) job has a "output_UBAM" property with the cloud storage path to the object I'm going to try the following steps:

  1. Find FastqToUbam jobs that don't have a Ubam.
  2. Get the value of the "output_UBAM" property of the job nodes.
  3. Use a gcloud command to update the metadata of each missing Ubam object, thus triggering Trellis to add the Ubam node to the database.
pbilling commented 2 years ago

1. & 2. Get cloud storage path of missing Ubams

// Cypher query
MATCH (job:Job:FastqToUbam)-[:STATUS]->(d:Dstat)
WHERE NOT (job)-[:GENERATED]->(:Ubam)
AND (d.status = "SUCCESS" OR d.status = "RUNNING")
RETURN job.output_UBAM
pbilling commented 2 years ago

Update Ubam object metadata (n=5)

Instead of trying to add all 20,000 Ubams to the database at once, I am going to try with a small subset of n=5 and see what happens.

Bash script to update object metadata:

#/bin/bash
# Update object metadata in Google Cloud Storage
# Input is a text file with one Google Storage path per line (e.g. "gs://bucket/path"

while IFS= read -r line; do
    gsutil setmeta -h 'x-goog-meta-trellis-ver:1.2.3' "$line"
done < "$1"

Console commands:

head -n5 missing-ubams.csv > missing-ubams-n5.csv
./update-object-metadata.sh missing-ubams-n5.csv
pbilling commented 2 years ago

After running the script, I noticed that the MERGE queries used to add the nodes to the database are timing out (limit = 90 seconds). Right now nodes are merged on a composite index :Blob(bucket, path) which I think are known to be less performant than single-value indexes. If I run the same query but use the :Ubam(uri) index, the same command runs in 1.3 seconds.

I'm going to deploy a hotfix that updates the query to use the :Ubam(uri) instead.

pbilling commented 2 years ago

From the Logs Explorer for the db-query function I can see that the MERGE queries seem to have run correctly. I can also verify this by query for the Ubams in the database.

trellis-time-to-run-merge-query

pbilling commented 2 years ago

I'm also going to drop the index on :Blob(path, bucket) and replace it with :Blob(uri)

// Cypher query
DROP INDEX ON :Blob(path, bucket)
// Cypher query
CREATE INDEX ON :Blob(uri)
pbilling commented 2 years ago

https://github.com/StanfordBioinformatics/trellis-mvp-functions/commit/8c28c3401543256e1d5273f224add3d0155fa659