unit testing the dedupe process

hellonewman commented 3 years ago

We want to get the dedupe function as accurate as possible. @maxachis is working on some documentation for the tests. Anyone is welcome to work on testing!

maxachis commented 3 years ago

Here's the current status of the work, at least on my end:

I added Pytest and a few basic tests along with an automated workflow. As is, this won't do a whole lot, but it does show basic components of the template. If we get in some python data prep scripts, it shouldn't be too much trouble to begin formalized unit testing for them.
I ran into issues with using Pytest to run R Scripts. The problem wasn't in the script itself, but in the R libraries that my script is dependent on. There's an art to combining R and Python scripts along with all dependencies that I haven't yet figured out
I'm also looking into parallel R automation, under the theory that it might be easier to run unit tests for Python scripts with Pytest, and unit tests for R scripts with something else.

We can talk about this stuff more at the meeting tomorrow.

maxachis commented 3 years ago

Another possibility is to convert my R script to a python script, which obviously would likely work better with Pytest. Since the work we're doing with data merges isn't terribly sophisticated, and Python can quite easily perform those data management actions (up to and including using a pseudo-R library to perform them), I think it's worth contemplating.

hellonewman commented 3 years ago

@maxachis and @cgmoreno to see if there's an R testing framework that could be used instead of PyTest.

maxachis commented 3 years ago

Pull #69 adds the "Testtthat" R unit testing framework, along with a Github action to perform it.

So we now have a system where Python scripts can be tested with Pytest, and R scripts with Testthat. I also added documentation on how to create and run tests.

One disadvantage with the Testthat automation compared to the Pytest automation is that it takes a bit longer for R to install the necessary packages for the R scripts--a few minutes, leading to an overall automation runtime of ~8 minutes, while pytest currently finishes in under a minute. I wouldn't be surprised if there's a way to speed that up, but I didn't find it in my search.

My next step is likely to modularize my merge_duplicates.R script so that it is more test-friendly. As is, it has hard-coded references to input and output file locations, which makes it difficult to swap in different inputs to test. Modularization should hopefully also make it easier to pipeline.

conorotompkins commented 3 years ago

@conorotompkins will see if auto_text_process_name.R can be improved to catch more cases. Use data processed by auto_agg_clean_data.R from "food-data/Cleaned_data_files".

conorotompkins commented 3 years ago

@hellonewman @cgmoreno what level of missingness are we willing to accept for "type"? it is at 20% missing right now.

hellonewman commented 3 years ago

@conorotompkins sorry I only replied in my head! 20% seems reasonable to me.

hellonewman commented 3 years ago

4/13: MAX and MATT to discuss whether machine learning can replace the need for unit testing.

maxachis commented 3 years ago

To clarify, the machine learning is optimizing the duplication identification, not replacing it. So Unit testing still going on, and we can always add more tests.

maxachis commented 3 years ago

To-Do List for Unit Tests:

~Fill out existing stubs for test_id_duplicates.py~ DONE
~Add unit tests for auto_agg_clean_data.R~ NOT NEEDED
Add unit tests for auto_clean_addresses_wrapper.py (and break it up into wrapper and testable "function" files.)
Add unit tests for auto_text_process (and break it up into wrapper and testable "function" files).
Look into if merge_datasets could use additional unit tests.
Figure out if data prep scripts could use unit testing as well, or if that would not be necessary.

We can always add more unit tests, but I think just giving some basic coverage to all the functions in our data flow would be a good idea.

hellonewman commented 2 years ago

Larry fixed!

CodeForPittsburgh / food-access-map-data

unit testing the dedupe process #60