clingen-data-model / genegraph

Presents an RDF triplestore of gene information using GraphQL APIs
5 stars 0 forks source link

Enable clinvar messages to be ingested from kafka out of order #775

Open theferrit32 opened 1 year ago

theferrit32 commented 1 year ago

Recent changes to the variation-normalization deployment to use nginx load balancing enables more performant parallel request handling.

In the vrs_analysis module, I enabled parallelism that outputs finished inputs out of order, as soon as completed, in order to maximize parallelism while not blowing up memory usage. In the snapshot module, we now take advantage of the nginx load balancing to improve the parallelism of each batch of kafka records, but if one event takes a long time to finish, the snapshot-populate function cannot proceed to processing the next batch of kafka messages. The greedy parallelism only happens within each batch of kafka messages.

One way to speed up the snapshot-populate function therefore would be to enable it to start processing messages in the next batches of kafka messages before it has finished the current one. In order to control RAM usage this will also require outputting finished events out of order by using claypoole/upmap. But we do have some constraints around what has to be processed before other things. At the moment, clinical_assertions need their variations, traits, and trait_sets to exist in the local database beforehand because they get embedded/linked to a specific version during the add-data-for-clinical-assertion transformer function.

There are two possible solutions to enable out of order processing:

  1. do not do any lookups to other objects in the local db during ingest/transform. This will push that logic into the query side, the snapshot function and any other function that wants to get a full clinical_assertion document will need to do additional queries for the individual subcomponents
  2. process out-of-order only within a data type. As in, the order of processing variations doesn't matter, but variations need to all be processed before clinical_assertions. We can use a lazy-seq partitioning on entity_type to do this. Where each partition is processed with claypoole/upmap, but the list of partitions is processed one at a time.

I think (2) is a preferable option because it doesn't require changing any code in the transformers, and this logic ideally could be applied to other spots we ingest messages which can have some disordering but do need some amount of order.

Genegraph unordered parallel ingest