CodeForPittsburgh / food-access-map-data

Data for the food access map
MIT License
8 stars 18 forks source link

unit testing the dedupe process #60

Closed hellonewman closed 2 years ago

hellonewman commented 3 years ago

We want to get the dedupe function as accurate as possible. @maxachis is working on some documentation for the tests. Anyone is welcome to work on testing!

maxachis commented 3 years ago

Here's the current status of the work, at least on my end:

We can talk about this stuff more at the meeting tomorrow.

maxachis commented 3 years ago

Another possibility is to convert my R script to a python script, which obviously would likely work better with Pytest. Since the work we're doing with data merges isn't terribly sophisticated, and Python can quite easily perform those data management actions (up to and including using a pseudo-R library to perform them), I think it's worth contemplating.

hellonewman commented 3 years ago

@maxachis and @cgmoreno to see if there's an R testing framework that could be used instead of PyTest.

maxachis commented 3 years ago

Pull #69 adds the "Testtthat" R unit testing framework, along with a Github action to perform it.

So we now have a system where Python scripts can be tested with Pytest, and R scripts with Testthat. I also added documentation on how to create and run tests.

One disadvantage with the Testthat automation compared to the Pytest automation is that it takes a bit longer for R to install the necessary packages for the R scripts--a few minutes, leading to an overall automation runtime of ~8 minutes, while pytest currently finishes in under a minute. I wouldn't be surprised if there's a way to speed that up, but I didn't find it in my search.

My next step is likely to modularize my merge_duplicates.R script so that it is more test-friendly. As is, it has hard-coded references to input and output file locations, which makes it difficult to swap in different inputs to test. Modularization should hopefully also make it easier to pipeline.

conorotompkins commented 3 years ago

@conorotompkins will see if auto_text_process_name.R can be improved to catch more cases. Use data processed by auto_agg_clean_data.R from "food-data/Cleaned_data_files".

conorotompkins commented 3 years ago

@hellonewman @cgmoreno what level of missingness are we willing to accept for "type"? it is at 20% missing right now.

hellonewman commented 3 years ago

@conorotompkins sorry I only replied in my head! 20% seems reasonable to me.

hellonewman commented 3 years ago

4/13: MAX and MATT to discuss whether machine learning can replace the need for unit testing.

maxachis commented 3 years ago

To clarify, the machine learning is optimizing the duplication identification, not replacing it. So Unit testing still going on, and we can always add more tests.

maxachis commented 3 years ago

To-Do List for Unit Tests:

We can always add more unit tests, but I think just giving some basic coverage to all the functions in our data flow would be a good idea.

hellonewman commented 2 years ago

Larry fixed!