CodeForPittsburgh / food-access-map-data

Data for the food access map
MIT License
8 stars 18 forks source link

Overhaul/Remove test_id_duplicates? #216

Closed maxachis closed 2 years ago

maxachis commented 2 years ago

The unit testing for ID duplicates is problematic, for the below reasons:

  1. Current unit testing involves a "test dataset" that is over a year out of date, and has not been subject to the same data cleaning/prep work that our current dataset uses. So if a test fails (or succeeds) in the unit tests, we don't know if that scenario is generalizable to our current dataset.
  2. Because the deduplication software performs comparisons across the entire dataset, we can't just give it a dataset of 10 or so rows and ask "Do you think any of these are duplicates?" because that's not the same environment as the whole dataset.
  3. The deduplication software is machine learning-based -- it works based on training (identifying duplicates and non-duplicates), so if a test fails, there's not an easy way to correct it, because the solution is improving the training data, which may not easily map to specific test cases.
  4. Current unit testing code for deduplicates is pretty wonky. If a test fails, there's not an easy way to identify what was properly vs. improperly identified as duplicates, which makes it harder to work backward and identify the problem.
  5. Similarly, current unit testing involves referencing rows from the test dataset by their ID number, rather than by any other bits of information. So if someone mixes up row numbers, or the IDs are off by one or something, a test will fail for a really silly reason.
  6. Previously, tests have broken despite no changes being made to the ID duplicate code. This is because of the weirdness of machine learning, not necessarily because we screwed something up. But that will still be logged as a test failing and would slow down pushing things to development.
  7. I previously set up merge logging code that can tell us -- in much more detail -- what rows were identified as duplicates and merged. Combining that with the previously mentioned training may make more sense than doing unit testing.

Because of all these reasons, I think it may make sense to remove unit tests for the deduplication part. If we don't want that, we'll need to do some substantial overhauls to make it more comprehensible and less intrusive -- for example, providing some statistics on how many rows were identified as duplicates, or simply looking at Merge Logging to see if rows we expect to be merged are being merged.

hellonewman commented 2 years ago

Larry fixed!