CodeForPittsburgh / food-access-map-data

Data for the food access map
MIT License
8 stars 18 forks source link

Proposal?: Adding intermediate files or logging information to better audit/debug run.sh #173

Closed maxachis closed 2 years ago

maxachis commented 3 years ago

Run.sh involves substantial modifications to some rows of data -- Type assignment, Geocoding, and even merging some duplicates. We have source datasets that should (in theory) persist and allow us to look at the source datafiles to see what is being combined by run.sh, but we do not have a way of knowing how the different scripts in the middle of run.sh -- everything after auto_agg_clean_data.R and before auto_merge_duplicates_wrapper.R -- are changing the data, except by seeing how it looks at the end as merged_datasets.csv.

In other words, we don't have a way to easily see how different scripts are modifying rows of the data. This might be a concern if/when we are trying to debug an issue and want to figure out what is causing a problem.

The biggest example of this being an issue is "auto_merge_duplicates_wrapper.R". This script combines rows identified as duplicates, replacing them with one merged row that should, in theory, have all the necessary information, and resolve conflicts in some fields. In merged_datasets.csv, we only see the final output of this process -- the merged row. The rows pre-merged are gone from the dataset. So if you want to manually check that the merge is functioning in our production environment, your best bet is to run a modified version of the script from a local repository so that you can inspect the dataset pre-merge and post-merge. That takes time, and you may run into the issue of your data not quite being the same as the data on Github.

IF we decide this is an issue worth investigating, there are a few avenues I see:

Of course, this is all an "if"--none of it is necessary, and if we do have to debug or audit some work, there are other ways to do it, albeit ones that are a more bit time-consuming in the moment. And there's the typical caveat that any change to the code runs the risk of something else breaking, although some of the above options (such as the "log" column option) have more of a risk of this than others.

maxachis commented 2 years ago

I've created a branch with a merge logging component to help with identifying rows that are merged. Its current output is seen below.

https://github.com/CodeForPittsburgh/food-access-map-data/blob/2022_05_10_Add_Merge_Logging/merge.log

If it looks good, I can go ahead and merge it.

maxachis commented 2 years ago

In addition to the above, I'll add that the merge log helped me identify a few rows that appear to be improperly merged -- typically stores on the same street. I don't want to make a separate issue for that when the merge log I use to point it out isn't part of the main branch, but I want to keep it on the radar as something to attend to.

maxachis commented 2 years ago

Merge logging added, so for now we will close this issue!

maxachis commented 2 years ago

Wait, false alarm. It looks like the merge_logging hasn't been added! I've reopened this and submitted a pull request.