Closed hellonewman closed 2 years ago
Here's the current status of the work, at least on my end:
We can talk about this stuff more at the meeting tomorrow.
Another possibility is to convert my R script to a python script, which obviously would likely work better with Pytest. Since the work we're doing with data merges isn't terribly sophisticated, and Python can quite easily perform those data management actions (up to and including using a pseudo-R library to perform them), I think it's worth contemplating.
@maxachis and @cgmoreno to see if there's an R testing framework that could be used instead of PyTest.
Pull #69 adds the "Testtthat" R unit testing framework, along with a Github action to perform it.
So we now have a system where Python scripts can be tested with Pytest, and R scripts with Testthat. I also added documentation on how to create and run tests.
One disadvantage with the Testthat automation compared to the Pytest automation is that it takes a bit longer for R to install the necessary packages for the R scripts--a few minutes, leading to an overall automation runtime of ~8 minutes, while pytest currently finishes in under a minute. I wouldn't be surprised if there's a way to speed that up, but I didn't find it in my search.
My next step is likely to modularize my merge_duplicates.R script so that it is more test-friendly. As is, it has hard-coded references to input and output file locations, which makes it difficult to swap in different inputs to test. Modularization should hopefully also make it easier to pipeline.
@conorotompkins will see if auto_text_process_name.R can be improved to catch more cases. Use data processed by auto_agg_clean_data.R from "food-data/Cleaned_data_files".
@hellonewman @cgmoreno what level of missingness are we willing to accept for "type"? it is at 20% missing right now.
@conorotompkins sorry I only replied in my head! 20% seems reasonable to me.
4/13: MAX and MATT to discuss whether machine learning can replace the need for unit testing.
To clarify, the machine learning is optimizing the duplication identification, not replacing it. So Unit testing still going on, and we can always add more tests.
To-Do List for Unit Tests:
We can always add more unit tests, but I think just giving some basic coverage to all the functions in our data flow would be a good idea.
Larry fixed!
We want to get the dedupe function as accurate as possible. @maxachis is working on some documentation for the tests. Anyone is welcome to work on testing!