StanfordBioinformatics / trellis-mvp-functions

Trellis serverless data management framework for variant calling of VA MVP whole-genome sequencing data.
6 stars 1 forks source link

Relationship queries have to be routinely retried #16

Open pbilling opened 4 years ago

pbilling commented 4 years ago

Relationship queries currently follows this pattern:

MATCH (j:Job {trellisTaskId: 123}),
              (n:Blob {trellisTaskId: 123, id: 123})
WHERE NOT EXISTS(j.duplicate)
OR NOT j.duplicate=True
MERGE (j)-[:OUTPUT]->(n)

The problem with this is that requires that job and output nodes be added to the database in a synchronous fashion. In cases where the output is added before the job, the current solution is to wait a few seconds and then retry the relationship query (n) amount of times.

This is bad design because 1) it violates the asynchronous nature of the system and 2) even after multiple retries there are cases where the conditions for adding the relationship (i.e. job node is present) are still not met. Additionally, the retry queries increase the load on the database.

Solution to this should be straightforward; instead of matching the job node, just merge it. For example:

MATCH (n:Blob {trellisTaskId: 123, id: 123})
MERGE (j {trellisTaskId: 123})-[:OUTPUT]->(n)

The trade-off is that we lose the MATCH pattern that ensures that duplicate jobs are not related to outputs, but it's not clear how valuable this was in the first place. And now that duplication rates have been reduced, it's even less useful.

pbilling commented 4 years ago

Monitoring dashboard heatmap showing function execution time/time. The functions with a 5 sec runtime represent retried database queries.

Function Execution times  SUM  (2)