StanfordBioinformatics / trellis-mvp-functions

Trellis serverless data management framework for variant calling of VA MVP whole-genome sequencing data.
6 stars 1 forks source link

Query to relate JobRequest to Job failing #28

Open pbilling opened 2 years ago

pbilling commented 2 years ago

I am seeing that the query to relate (:JobRequest)-[:TRIGGERED]-(:Job) is routinely not succeeding on the first try and being requeued. Looking at the query it seems inefficient and in need of refactoring.

Query from "RelateJobToJobRequest" trigger with an example trellisTaskId:

1 MATCH (b:Blob)-[:WAS_USED_BY]->(j:Job { trellisTaskId: "220128-172700-948-23484cbb" }), 
2    (b)-[:WAS_USED_BY]->(jr:JobRequest {name: j.name}) 
3 MATCH (b2:Blob)-[:WAS_USED_BY]->(jr) 
4 WHERE NOT (jr)-[:TRIGGERED]->(:Job) 
5 WITH j, jr, COLLECT(DISTINCT b) AS jobInputs, COLLECT(DISTINCT b2) AS requestInputs 
6 WITH j, jr, jobInputs, requestInputs, [b in jobInputs WHERE NOT b in requestInputs] AS mismatches, [b in requestInputs WHERE NOT b in jobInputs] AS mismatches2 
7 WHERE size(mismatches) = size(mismatches2) = 0 MERGE (jr)-[:TRIGGERED]->(j)

I've added line numbers to the query to make it easier to talk about. It looks like my goal in writing this query was to match job requests to jobs by identifying those with the same input nodes and then connecting them. First, using the identity of input sets to map requests to jobs seems like a needlessly complex way to implement this. We can see that this also led to an ugly and difficult to decipher cypher query (sorry).

Things I hate about this query in order from most to least egregious:

  1. Back-to-back WITH statements on line 5 and 6.
  2. (2) list creation statements in the WITH statement on line 6.
  3. Math on line 7. Math isn't always bad, but in general it seems to slow down queries.

I should have looked at this query, seen how convoluted it was, and realized I needed to rethink my approach. But sometimes I get so caught up in figuring out how to do something that I forget to ask: should I be doing this?

pbilling commented 2 years ago

The simplest solution seems to be to generate an ID for each JobRequest and then pass that to the job launching function. Then that value can be used for mapping rather than requiring logic (which inputs are shared?) to map requests to jobs.

pbilling commented 2 years ago

JobRequest nodes already have an "eventId" property that references the unique Pub/Sub message that delivered the query. I could create an index on that and use it for mapping.