clingen-data-model / clinvar-streams

1 stars 0 forks source link

Fix deduplication of clinvar-raw stream #64

Closed theferrit32 closed 1 year ago

theferrit32 commented 2 years ago

Works when run against local file, but not against message. Likely some additional field is improperly being included or some content is not being parsed consistently.

theferrit32 commented 2 years ago

The problem is when running against a file I had parsed out the full message but when reading from a topic the :content.:content field was not parsed when comparing for duplicates.

https://github.com/clingen-data-model/clinvar-streams/blob/2458f23b154ec7112e292018cbed97b724dd21a2/src/clinvar_raw/stream.clj#L209

theferrit32 commented 2 years ago

Variation deduplication of 2019-07-01 to 2022-09-13. Note this includes a duplicate 2022-03-30 causing some noise of updates in the second of those the one following it.

[
{:msg "Deduplication counts for release", :release-date "2019-07-01", :input-counter 512507, :output-counter 512507, :reduction-ratio 0.0}
{:msg "Deduplication counts for release", :release-date "2019-07-31", :input-counter 57329, :output-counter 4719, :reduction-ratio 0.917685639030857}
{:msg "Deduplication counts for release", :release-date "2019-09-02", :input-counter 532351, :output-counter 514581, :reduction-ratio 0.03338023221521139}
{:msg "Deduplication counts for release", :release-date "2019-10-01", :input-counter 124957, :output-counter 31806, :reduction-ratio 0.7454644397672799}
{:msg "Deduplication counts for release", :release-date "2019-11-05", :input-counter 508543, :output-counter 497597, :reduction-ratio 0.0215242368885227}
{:msg "Deduplication counts for release", :release-date "2019-12-02", :input-counter 33716, :output-counter 1752, :reduction-ratio 0.9480365405148891}
{:msg "Deduplication counts for release", :release-date "2019-12-31", :input-counter 214498, :output-counter 110548, :reduction-ratio 0.4846199032158808}
{:msg "Deduplication counts for release", :release-date "2020-02-03", :input-counter 622633, :output-counter 620914, :reduction-ratio 0.002760855913515667}
{:msg "Deduplication counts for release", :release-date "2020-03-02", :input-counter 64461, :output-counter 7197, :reduction-ratio 0.8883510960115419}
{:msg "Deduplication counts for release", :release-date "2020-03-30", :input-counter 91787, :output-counter 19917, :reduction-ratio 0.7830084870406485}
{:msg "Deduplication counts for release", :release-date "2020-05-06", :input-counter 303872, :output-counter 72581, :reduction-ratio 0.7611461404802022}
{:msg "Deduplication counts for release", :release-date "2020-06-02", :input-counter 206440, :output-counter 100938, :reduction-ratio 0.5110540592908351}
{:msg "Deduplication counts for release", :release-date "2020-06-09", :input-counter 50526, :output-counter 707, :reduction-ratio 0.986007204211693}
{:msg "Deduplication counts for release", :release-date "2020-06-15", :input-counter 26491, :output-counter 1351, :reduction-ratio 0.9490015476954436}
{:msg "Deduplication counts for release", :release-date "2020-06-22", :input-counter 49596, :output-counter 15141, :reduction-ratio 0.6947132833293008}
{:msg "Deduplication counts for release", :release-date "2020-06-29", :input-counter 209150, :output-counter 186900, :reduction-ratio 0.1063829787234043}
{:msg "Deduplication counts for release", :release-date "2020-07-06", :input-counter 26229, :output-counter 2576, :reduction-ratio 0.9017880971443822}
{:msg "Deduplication counts for release", :release-date "2020-07-17", :input-counter 574100, :output-counter 548042, :reduction-ratio 0.04538930499912907}
{:msg "Deduplication counts for release", :release-date "2020-07-20", :input-counter 34711, :output-counter 5939, :reduction-ratio 0.828901500965112}
{:msg "Deduplication counts for release", :release-date "2020-07-28", :input-counter 129666, :output-counter 5315, :reduction-ratio 0.9590100720312187}
{:msg "Deduplication counts for release", :release-date "2020-08-03", :input-counter 17557, :output-counter 864, :reduction-ratio 0.9507888591445008}
{:msg "Deduplication counts for release", :release-date "2020-08-10", :input-counter 205471, :output-counter 39371, :reduction-ratio 0.808386584968195}
{:msg "Deduplication counts for release", :release-date "2020-08-17", :input-counter 68372, :output-counter 5799, :reduction-ratio 0.9151845784824197}
{:msg "Deduplication counts for release", :release-date "2020-08-24", :input-counter 79536, :output-counter 4824, :reduction-ratio 0.9393482196741098}
{:msg "Deduplication counts for release", :release-date "2020-08-30", :input-counter 49726, :output-counter 8739, :reduction-ratio 0.8242569279652496}
{:msg "Deduplication counts for release", :release-date "2020-09-05", :input-counter 27515, :output-counter 660, :reduction-ratio 0.9760130837724877}
{:msg "Deduplication counts for release", :release-date "2020-09-14", :input-counter 12055, :output-counter 833, :reduction-ratio 0.9309000414765657}
{:msg "Deduplication counts for release", :release-date "2020-09-20", :input-counter 24515, :output-counter 655, :reduction-ratio 0.9732816642871711}
{:msg "Deduplication counts for release", :release-date "2020-09-28", :input-counter 28151, :output-counter 1781, :reduction-ratio 0.9367340414194878}
{:msg "Deduplication counts for release", :release-date "2020-10-03", :input-counter 20026, :output-counter 801, :reduction-ratio 0.9600019974033756}
{:msg "Deduplication counts for release", :release-date "2020-10-10", :input-counter 72515, :output-counter 4079, :reduction-ratio 0.9437495690546783}
{:msg "Deduplication counts for release", :release-date "2020-10-20", :input-counter 33600, :output-counter 416, :reduction-ratio 0.9876190476190476}
{:msg "Deduplication counts for release", :release-date "2020-10-26", :input-counter 48789, :output-counter 23748, :reduction-ratio 0.5132509377113694}
{:msg "Deduplication counts for release", :release-date "2020-10-31", :input-counter 51481, :output-counter 3168, :reduction-ratio 0.9384627338241293}
{:msg "Deduplication counts for release", :release-date "2020-11-07", :input-counter 10107, :output-counter 1244, :reduction-ratio 0.876916988225982}
{:msg "Deduplication counts for release", :release-date "2020-11-14", :input-counter 24320, :output-counter 2125, :reduction-ratio 0.9126233552631579}
{:msg "Deduplication counts for release", :release-date "2020-11-22", :input-counter 15482, :output-counter 2177, :reduction-ratio 0.8593850923653275}
{:msg "Deduplication counts for release", :release-date "2020-11-29", :input-counter 38681, :output-counter 3339, :reduction-ratio 0.913678550192601}
{:msg "Deduplication counts for release", :release-date "2020-12-08", :input-counter 55771, :output-counter 2192, :reduction-ratio 0.9606964192860089}
{:msg "Deduplication counts for release", :release-date "2020-12-12", :input-counter 18828, :output-counter 1678, :reduction-ratio 0.9108774166135543}
{:msg "Deduplication counts for release", :release-date "2020-12-26", :input-counter 36318, :output-counter 28347, :reduction-ratio 0.21947794482075}
{:msg "Deduplication counts for release", :release-date "2021-01-02", :input-counter 26858, :output-counter 3777, :reduction-ratio 0.8593715094199121}
{:msg "Deduplication counts for release", :release-date "2021-01-10", :input-counter 66230, :output-counter 698, :reduction-ratio 0.9894609693492376}
{:msg "Deduplication counts for release", :release-date "2021-01-19", :input-counter 183829, :output-counter 6566, :reduction-ratio 0.9642820229669965}
{:msg "Deduplication counts for release", :release-date "2021-01-23", :input-counter 2, :output-counter 2, :reduction-ratio 0.0}
{:msg "Deduplication counts for release", :release-date "2021-01-28", :input-counter 236587, :output-counter 40167, :reduction-ratio 0.8302231314484735}
{:msg "Deduplication counts for release", :release-date "2021-01-31", :input-counter 48189, :output-counter 4654, :reduction-ratio 0.9034219427670215}
{:msg "Deduplication counts for release", :release-date "2021-02-08", :input-counter 96177, :output-counter 10160, :reduction-ratio 0.894361437765786}
{:msg "Deduplication counts for release", :release-date "2021-02-13", :input-counter 31767, :output-counter 1030, :reduction-ratio 0.967576415777379}
{:msg "Deduplication counts for release", :release-date "2021-02-21", :input-counter 44268, :output-counter 7641, :reduction-ratio 0.8273922472214692}
{:msg "Deduplication counts for release", :release-date "2021-03-02", :input-counter 132771, :output-counter 14510, :reduction-ratio 0.8907140866604906}
{:msg "Deduplication counts for release", :release-date "2021-03-08", :input-counter 55752, :output-counter 18653, :reduction-ratio 0.6654290429042904}
{:msg "Deduplication counts for release", :release-date "2021-03-15", :input-counter 107542, :output-counter 32844, :reduction-ratio 0.6945937401201391}
{:msg "Deduplication counts for release", :release-date "2021-03-23", :input-counter 73946, :output-counter 35472, :reduction-ratio 0.5202985962729559}
{:msg "Deduplication counts for release", :release-date "2021-03-28", :input-counter 39970, :output-counter 22053, :reduction-ratio 0.4482611958969227}
{:msg "Deduplication counts for release", :release-date "2021-04-04", :input-counter 37148, :output-counter 179, :reduction-ratio 0.9951814364164961}
{:msg "Deduplication counts for release", :release-date "2021-04-15", :input-counter 231067, :output-counter 17331, :reduction-ratio 0.9249957804446329}
{:msg "Deduplication counts for release", :release-date "2021-04-24", :input-counter 17275, :output-counter 991, :reduction-ratio 0.9426338639652677}
{:msg "Deduplication counts for release", :release-date "2021-05-01", :input-counter 17767, :output-counter 751, :reduction-ratio 0.9577306241909157}
{:msg "Deduplication counts for release", :release-date "2021-05-11", :input-counter 50063, :output-counter 12453, :reduction-ratio 0.7512534206899307}
{:msg "Deduplication counts for release", :release-date "2021-05-17", :input-counter 86961, :output-counter 32376, :reduction-ratio 0.6276951736985545}
{:msg "Deduplication counts for release", :release-date "2021-05-24", :input-counter 83270, :output-counter 43407, :reduction-ratio 0.4787198270685721}
{:msg "Deduplication counts for release", :release-date "2021-05-29", :input-counter 42772, :output-counter 9871, :reduction-ratio 0.769218180117834}
{:msg "Deduplication counts for release", :release-date "2021-06-09", :input-counter 86419, :output-counter 42842, :reduction-ratio 0.5042525370578229}
{:msg "Deduplication counts for release", :release-date "2021-06-16", :input-counter 101279, :output-counter 8944, :reduction-ratio 0.9116894914049309}
{:msg "Deduplication counts for release", :release-date "2021-06-19", :input-counter 19605, :output-counter 2275, :reduction-ratio 0.8839581739352206}
{:msg "Deduplication counts for release", :release-date "2021-06-26", :input-counter 28148, :output-counter 1107, :reduction-ratio 0.9606721614324286}
{:msg "Deduplication counts for release", :release-date "2021-07-07", :input-counter 34991, :output-counter 2062, :reduction-ratio 0.9410705610014004}
{:msg "Deduplication counts for release", :release-date "2021-07-10", :input-counter 29128, :output-counter 1897, :reduction-ratio 0.9348736610821203}
{:msg "Deduplication counts for release", :release-date "2021-07-18", :input-counter 20021, :output-counter 4394, :reduction-ratio 0.7805304430348134}
{:msg "Deduplication counts for release", :release-date "2021-07-24", :input-counter 19758, :output-counter 3855, :reduction-ratio 0.8048891588217431}
{:msg "Deduplication counts for release", :release-date "2021-07-31", :input-counter 14863, :output-counter 1328, :reduction-ratio 0.9106506088945704}
{:msg "Deduplication counts for release", :release-date "2021-08-07", :input-counter 21960, :output-counter 7763, :reduction-ratio 0.6464936247723133}
{:msg "Deduplication counts for release", :release-date "2021-08-14", :input-counter 33581, :output-counter 7345, :reduction-ratio 0.7812751258151931}
{:msg "Deduplication counts for release", :release-date "2021-08-21", :input-counter 41661, :output-counter 13955, :reduction-ratio 0.6650344446844771}
{:msg "Deduplication counts for release", :release-date "2021-08-28", :input-counter 24016, :output-counter 5042, :reduction-ratio 0.7900566289140573}
{:msg "Deduplication counts for release", :release-date "2021-09-08", :input-counter 48345, :output-counter 14837, :reduction-ratio 0.693101665115317}
{:msg "Deduplication counts for release", :release-date "2021-09-12", :input-counter 28862, :output-counter 19666, :reduction-ratio 0.3186196382787056}
{:msg "Deduplication counts for release", :release-date "2021-09-19", :input-counter 47862, :output-counter 31615, :reduction-ratio 0.3394551000793949}
{:msg "Deduplication counts for release", :release-date "2021-09-27", :input-counter 133914, :output-counter 49386, :reduction-ratio 0.6312110757650432}
{:msg "Deduplication counts for release", :release-date "2021-09-29", :input-counter 82702, :output-counter 9428, :reduction-ratio 0.8860003385649682}
{:msg "Deduplication counts for release", :release-date "2021-10-02", :input-counter 30437, :output-counter 4202, :reduction-ratio 0.8619443440549331}
{:msg "Deduplication counts for release", :release-date "2021-10-10", :input-counter 254162, :output-counter 5900, :reduction-ratio 0.9767864590300674}
{:msg "Deduplication counts for release", :release-date "2021-10-16", :input-counter 14017, :output-counter 661, :reduction-ratio 0.9528429763858172}
{:msg "Deduplication counts for release", :release-date "2021-10-25", :input-counter 354014, :output-counter 338227, :reduction-ratio 0.04459428158208432}
{:msg "Deduplication counts for release", :release-date "2021-10-30", :input-counter 155142, :output-counter 2358, :reduction-ratio 0.984801021000116}
{:msg "Deduplication counts for release", :release-date "2021-11-07", :input-counter 80053, :output-counter 18923, :reduction-ratio 0.7636191023446967}
{:msg "Deduplication counts for release", :release-date "2021-11-13", :input-counter 49255, :output-counter 1423, :reduction-ratio 0.9711095320272054}
{:msg "Deduplication counts for release", :release-date "2021-11-21", :input-counter 99785, :output-counter 39956, :reduction-ratio 0.5995790950543669}
{:msg "Deduplication counts for release", :release-date "2021-11-30", :input-counter 684844, :output-counter 581557, :reduction-ratio 0.1508182885445445}
{:msg "Deduplication counts for release", :release-date "2021-12-04", :input-counter 177404, :output-counter 140718, :reduction-ratio 0.2067935334039819}
{:msg "Deduplication counts for release", :release-date "2021-12-12", :input-counter 67024, :output-counter 1672, :reduction-ratio 0.9750537121031272}
{:msg "Deduplication counts for release", :release-date "2021-12-18", :input-counter 37771, :output-counter 1113, :reduction-ratio 0.9705329485584178}
{:msg "Deduplication counts for release", :release-date "2021-12-25", :input-counter 35257, :output-counter 1902, :reduction-ratio 0.9460532660180957}
{:msg "Deduplication counts for release", :release-date "2022-01-04", :input-counter 145212, :output-counter 1905, :reduction-ratio 0.9868812494835137}
{:msg "Deduplication counts for release", :release-date "2022-01-09", :input-counter 247735, :output-counter 3224, :reduction-ratio 0.9869860940117464}
{:msg "Deduplication counts for release", :release-date "2022-01-15", :input-counter 59089, :output-counter 3740, :reduction-ratio 0.9367056474132242}
{:msg "Deduplication counts for release", :release-date "2022-01-22", :input-counter 41893, :output-counter 1166, :reduction-ratio 0.9721671878356766}
{:msg "Deduplication counts for release", :release-date "2022-01-29", :input-counter 64201, :output-counter 6712, :reduction-ratio 0.8954533418482578}
{:msg "Deduplication counts for release", :release-date "2022-02-05", :input-counter 64689, :output-counter 5096, :reduction-ratio 0.921223082749772}
{:msg "Deduplication counts for release", :release-date "2022-02-13", :input-counter 97223, :output-counter 2813, :reduction-ratio 0.9710665171821483}
{:msg "Deduplication counts for release", :release-date "2022-02-23", :input-counter 625527, :output-counter 3450, :reduction-ratio 0.9944846505426624}
{:msg "Deduplication counts for release", :release-date "2022-02-28", :input-counter 46845, :output-counter 4436, :reduction-ratio 0.9053047283594834}
{:msg "Deduplication counts for release", :release-date "2022-03-06", :input-counter 53178, :output-counter 1779, :reduction-ratio 0.9665463161457746}
{:msg "Deduplication counts for release", :release-date "2022-03-13", :input-counter 56937, :output-counter 6044, :reduction-ratio 0.8938475859283067}
{:msg "Deduplication counts for release", :release-date "2022-03-20", :input-counter 83964, :output-counter 5517, :reduction-ratio 0.9342932685436616}
{:msg "Deduplication counts for release", :release-date "2022-03-30", :input-counter 835313, :output-counter 192174, :reduction-ratio 0.7699377359145614}
{:msg "Deduplication counts for release", :release-date "2022-04-03", :input-counter 84396, :output-counter 2601, :reduction-ratio 0.9691810038390445}
{:msg "Deduplication counts for release", :release-date "2022-03-30", :input-counter 283537, :output-counter 4853, :reduction-ratio 0.9828840680405027}
{:msg "Deduplication counts for release", :release-date "2022-04-13", :input-counter 501164, :output-counter 150070, :reduction-ratio 0.7005571030640669}
{:msg "Deduplication counts for release", :release-date "2022-04-16", :input-counter 62640, :output-counter 731, :reduction-ratio 0.9883301404853129}
{:msg "Deduplication counts for release", :release-date "2022-04-25", :input-counter 676427, :output-counter 4470, :reduction-ratio 0.9933917481117696}
{:msg "Deduplication counts for release", :release-date "2022-04-30", :input-counter 43538, :output-counter 3740, :reduction-ratio 0.9140980293077312}
{:msg "Deduplication counts for release", :release-date "2022-05-07", :input-counter 43365, :output-counter 2875, :reduction-ratio 0.933702294477113}
{:msg "Deduplication counts for release", :release-date "2022-05-17", :input-counter 294422, :output-counter 6908, :reduction-ratio 0.9765370794302056}
{:msg "Deduplication counts for release", :release-date "2022-05-25", :input-counter 293160, :output-counter 18964, :reduction-ratio 0.9353117751398554}
{:msg "Deduplication counts for release", :release-date "2022-05-28", :input-counter 60827, :output-counter 5109, :reduction-ratio 0.9160076939516991}
{:msg "Deduplication counts for release", :release-date "2022-06-06", :input-counter 78187, :output-counter 13472, :reduction-ratio 0.8276951411359945}
{:msg "Deduplication counts for release", :release-date "2022-06-11", :input-counter 64008, :output-counter 1426, :reduction-ratio 0.977721534808149}
{:msg "Deduplication counts for release", :release-date "2022-06-19", :input-counter 135783, :output-counter 135783, :reduction-ratio 0.0}
{:msg "Deduplication counts for release", :release-date "2022-06-26", :input-counter 167203, :output-counter 136313, :reduction-ratio 0.1847454890163454}
{:msg "Deduplication counts for release", :release-date "2022-07-02", :input-counter 89367, :output-counter 2269, :reduction-ratio 0.9746103147694339}
{:msg "Deduplication counts for release", :release-date "2022-07-10", :input-counter 81828, :output-counter 3410, :reduction-ratio 0.9583272229554676}
{:msg "Deduplication counts for release", :release-date "2022-07-19", :input-counter 73230, :output-counter 5916, :reduction-ratio 0.9192134371159361}
{:msg "Deduplication counts for release", :release-date "2022-07-24", :input-counter 69067, :output-counter 674, :reduction-ratio 0.9902413598389969}
{:msg "Deduplication counts for release", :release-date "2022-08-01", :input-counter 63153, :output-counter 1574, :reduction-ratio 0.9750764017544693}
{:msg "Deduplication counts for release", :release-date "2022-08-13", :input-counter 395497, :output-counter 2664, :reduction-ratio 0.9932641714096441}
{:msg "Deduplication counts for release", :release-date "2022-08-24", :input-counter 373363, :output-counter 1956, :reduction-ratio 0.99476113058873}
{:msg "Deduplication counts for release", :release-date "2022-08-29", :input-counter 56399, :output-counter 1641, :reduction-ratio 0.970903739428004}
{:msg "Deduplication counts for release", :release-date "2022-09-03", :input-counter 92450, :output-counter 28015, :reduction-ratio 0.6969713358572202}
{:msg "Deduplication counts for release", :release-date "2022-09-10", :input-counter 53380, :output-counter 949, :reduction-ratio 0.9822218059198202}
{:msg "Deduplication counts for release", :release-date "2022-09-19", :input-counter 125302, :output-counter 91609, :reduction-ratio 0.2688943512473863}
]
theferrit32 commented 2 years ago
(-> "textPayloads.json.edn"
    slurp
    read-string
    (->> (reduce
          (fn [agg obj]
            (println :agg agg :obj obj)
            {:in (+ (:in agg) (:input-counter obj))
             :out (+ (:out agg) (:output-counter obj))})
          {:in 0 :out 0})))
==> {:in 16562369, :out 5945855}

Overall message count reduced by 0.641

theferrit32 commented 2 years ago
SELECT count(*) FROM `clingen-stage.clinvar_2019_07_01_v1_1_0_m2.variation` LIMIT 1000
==> 512505
(plus start and end release sentinel = 512507 as the initial dataset)
theferrit32 commented 1 year ago

Some messages removed from the first release, where there should be no duplicates. Need deeper debugging here to identify if it is okay for those to have been removed. Could potentially be trait_mappings. Or the compound key for something is incorrect and objects were incorrectly identified as duplicates.

{:msg "Deduplication counts for release", :release-date "2019-07-01", :in-count 8614017, :out-count 8613437, :removed-ratio 6.733211694381379E-5}

theferrit32 commented 1 year ago

The duplicates within a release appear to all be trait_mappings, which happens when a submission contains one or more observations, and so lists a trait multiple times. Each will normalize to the same clinvar trait with the same mapping field values, and also have the same clinical_assertion_id.

theferrit32 commented 1 year ago

Added per-type counts. With the fix, this is the dedup count log for 2019-07-01 release. The trait mapping dup count of 13446 in this first release is validated against the BigQuery table.

{:msg "Deduplication counts for release",
 :release-date "2019-07-01",
 :in-type-counts
 {"clinical_assertion_variation" 832130,
  "clinical_assertion_trait" 992823,
  "gene_association" 1426360,
  "gene" 32137,
  "clinical_assertion" 820724,
  "trait_mapping" 984984,
  "clinical_assertion_trait_set" 874714,
  "rcv_accession" 723122,
  "clinical_assertion_observation" 871375,
  "submission" 5407,
  "trait" 11946,
  "submitter" 1320,
  "variation" 512505,
  "variation_archive" 511904,
  "release_sentinel" 2,
  "trait_set" 12564},
 :out-type-counts
 {"clinical_assertion_variation" 832130,
  "clinical_assertion_trait" 992823,
  "gene_association" 1426360,
  "gene" 32137,
  "clinical_assertion" 820724,
  "trait_mapping" 971538,
  "clinical_assertion_trait_set" 874714,
  "rcv_accession" 723122,
  "clinical_assertion_observation" 871375,
  "submission" 5407,
  "trait" 11946,
  "submitter" 1320,
  "variation" 512505,
  "variation_archive" 511904,
  "release_sentinel" 2,
  "trait_set" 12564},
 :removed-type-counts
 {"clinical_assertion_variation" 0,
  "clinical_assertion_trait" 0,
  "gene_association" 0,
  "gene" 0,
  "clinical_assertion" 0,
  "trait_mapping" 13446,
  "clinical_assertion_trait_set" 0,
  "rcv_accession" 0,
  "clinical_assertion_observation" 0,
  "submission" 0,
  "trait" 0,
  "submitter" 0,
  "variation" 0,
  "variation_archive" 0,
  "release_sentinel" 0,
  "trait_set" 0},
 :in-count 8614017,
 :out-count 8600571,
 :removed-ratio 0.001560944214528483}