clingen-data-model / clinvar-streams

1 stars 0 forks source link

Update/Expand Test Data set #55

Closed larrybabb closed 2 years ago

larrybabb commented 2 years ago

The clinvar test data set being used in development should be updated to the current release and expanded to include additional use cases, specifically CNVs and protein only variants. We need to assure we have a wider range of use cases now that we have established the baseline transformation pipeline for variation.

larrybabb commented 2 years ago

@theferrit32 Here’s a new test dataset for clinvar that has all the VCEP scvs for the RUN1X gene as well as all the scvs for the variants that have been curated thus far with the chrome extension as well as all the previous cherry picked variations you’ve been dealing with. I also added in a dozen or so RUN1X CNVs for good measure. I’m avoiding the Pgx star allele genotypes and haplotypes at this point as they are too distracting and disruptive. But we can discuss if you’d like. This set contains all clinvar releases up to 5/17/22

larrybabb commented 2 years ago

I stuck the dataset in the clingen-dev clinvar-streams-dev bucket for now. feel free to move it.

larrybabb commented 2 years ago

@theferrit32 should this be in "review" or "in progress"? We should agree on what this ticket is about so we can determine if I'm reviewing that you have updated the testing pipeline you are working with or if you are reviewing that the test data set I gave you is "error-free" and useful to being implementing into the test workflow. Let me know your thoughts on which way this ticket should go. Feel free to modify it's status or description so we can track it properly.

theferrit32 commented 2 years ago

@larrybabb this should be in progress, I am working on it now

theferrit32 commented 2 years ago

New test data set is mostly loaded. There is a bug discovered in my upload function which makes it only upload part of the sequence of releases, stopping early. This needs to be fixed (which is in progress) but there is enough data now to get CNV variations downstream into the variation transformer.

theferrit32 commented 2 years ago

There was a bug in this function that made it miss some timestamp directories. https://github.com/clingen-data-model/clinvar-streams/blob/e78e38ae00afcf7c784f1fc397aca83947c57e6e/test/clinvar_raw/generate_local_topic.clj#L78-L92

The new clinvar-raw test stream has been loaded to clinvar-raw-testdata_20220523. There are 147410 messages.