clingen-data-model / clinvar-streams

1 stars 0 forks source link

False negative updates are originating in the clinvar_raw event stream #57

Closed larrybabb closed 1 year ago

larrybabb commented 2 years ago

The DSP clinvar ingest process is producing false negative "update" records which seem to be caused by the challenges related to comparing the content fields of the various classes being parsed out of the clinvar xml data.

The content fields contain serialized json strings for any of the original XML nodes that are not explicitly parsed out of the XML during the DSP clinvar ingestion parsing and diffing process.

These content fields are not consistently organized and thus can be identical in content but not in order.

Per discussions with @tbl3rd the most reasonable path for controlling and resolving this is possibly at the point of the process where we are notified of a new release, we pull in the delta files, parse and stream the initial messages onto the clinvar-raw topics in the Data Exchange (confluent kafka service).

If it is possible to compare the update records with the last representation of that same class to make sure that the update is a true update (which may involve inspecting all fields not just the content) then we can drop the false negatives on the floor and avoid the downstream noise being produced.

I haven't done exhaustive analysis but there are indications that the variation class is producing an enormous amount of false negative updates across sporadic releases. It is likely the bug that is causing the inability to properly compare content fields in the dsp clinvar ingest processing is also causing false negatives in other classes as well.

larrybabb commented 2 years ago

@tbl3rd I believe this is the program that picks up the delta files from DSP, parses them in a specific order and produces the messages onto the clinvar-raw topic for subsequent transformation and deposit into genegraph.

@theferrit32 can confirm and walk you through the design of this project.

larrybabb commented 2 years ago

We should consider whether issue #58 will have an impact on where and how we implement this solution.

larrybabb commented 2 years ago

So @tbl3rd please push back if i've overstepped. But I assumed the draft pull request "Experiment with ingest ideas" was meant for working on this ticket, so I connected it. If that is cool, then I would like to move this pull request over to the clinvar-streams repo. By default, zenhub picks the genegraph repo. I think there's a transfer/move function in git which allows it to be moved in case you want to do it.

larrybabb commented 2 years ago

@tbl3rd I also assume you know that you can connect more than one request to a single PR and multiple PRs to a single request. I really prefer it when the devs connect the issues that are addressed in whole be a PR so that zenhub can present them together.