Closed larrybabb closed 1 year ago
@tbl3rd I believe this is the program that picks up the delta files from DSP, parses them in a specific order and produces the messages onto the clinvar-raw topic for subsequent transformation and deposit into genegraph.
@theferrit32 can confirm and walk you through the design of this project.
We should consider whether issue #58 will have an impact on where and how we implement this solution.
So @tbl3rd please push back if i've overstepped. But I assumed the draft pull request "Experiment with ingest ideas" was meant for working on this ticket, so I connected it. If that is cool, then I would like to move this pull request over to the clinvar-streams repo. By default, zenhub picks the genegraph repo. I think there's a transfer/move function in git which allows it to be moved in case you want to do it.
@tbl3rd I also assume you know that you can connect more than one request to a single PR and multiple PRs to a single request. I really prefer it when the devs connect the issues that are addressed in whole be a PR so that zenhub can present them together.
The DSP clinvar ingest process is producing false negative "update" records which seem to be caused by the challenges related to comparing the
content
fields of the various classes being parsed out of the clinvar xml data.The
content
fields contain serialized json strings for any of the original XML nodes that are not explicitly parsed out of the XML during the DSP clinvar ingestion parsing and diffing process.These
content
fields are not consistently organized and thus can be identical in content but not in order.Per discussions with @tbl3rd the most reasonable path for controlling and resolving this is possibly at the point of the process where we are notified of a new release, we pull in the delta files, parse and stream the initial messages onto the clinvar-raw topics in the Data Exchange (confluent kafka service).
If it is possible to compare the update records with the last representation of that same class to make sure that the update is a true update (which may involve inspecting all fields not just the content) then we can drop the false negatives on the floor and avoid the downstream noise being produced.
I haven't done exhaustive analysis but there are indications that the variation class is producing an enormous amount of false negative updates across sporadic releases. It is likely the bug that is causing the inability to properly compare
content
fields in the dsp clinvar ingest processing is also causing false negatives in other classes as well.