Closed maxachis closed 2 years ago
Uploaded new source prioritization file for @maxachis to check out.
The unit tess pass, and running the merged_dataset shows the merges occurring. I was originally just going to say "Looks good to me" and then merge the branches, but I think it's better to be safe and have a look at the rows prior to merge to make sure they are merging as expected. It'd be convenient -- at least in this unique situation -- if we have a method to better audit this stuff, which I describe in Issue #173. Still, for now I'll need to do some manual verification before I'm totally comfortable with merging.
@maxachis Any chance you'd be able to take a dive into this, or nominate someone to do a final check before we merge?
It's taken me 6-7 months to get to this, but I'm having a look at adding some basic logging to indicate what rows are being merged.
Never let it be said that I am not a punctual person.
I did manage to put together some logging to indicate what rows are being merged!
If it looks acceptable, we can merge the branches, and I think with that we'll be good for closing this issue.
Merge logging has been added, so Max (that's me) needs to pull that merge logging script into the branch for source field prioritization and see that everything is merging as expected.
Merged master branch into source field prioritization branch. Currently running merge dataset Github action to see how it all looks!
False alarm -- merge logging has not yet been added. Waiting for that before I move forward. Once issue #173 closes, I can go ahead with this.
Here's how I've tested
I look at a case of two rows being merged (for example "PITTSBURGH URBAN GARDENING PROJECT" and the misspelled "PITTSBUGH URBAN GRADEN PROJECT"), look at their source files (cleaned_growpgh.csv and FMNPMarkets.csv, respectively) and a value where they conflict, (FMNP flag set to 0 and 1, respectively).
Then I check the FMNP flag in source_field_prioritization.csv. I note that only FMNPMarkets.csv has a value, while cleaned_growpgh.csv does not, so that means that FMNPMarkets.csv flies should take priority, and the final value of FMNP should be 1.
However, that's not what happens in this case! The final value is 0. So something's off, and must be investigated!
Larry fixed!
Currently, we have a file called "data_prep_scripts/source_field_prioritization_sample_data.csv". This file, which is called by auto_merge_duplicates_wrapper.R, was designed provide an example of how we would prioritize different data sets which give different values for certain attributes of the same, such as whether SNAP was or was not accepted. Essentially, if two addresses were found to be duplicates of the same location, we would look at the source organization of each location, and use a source field prioritization chart to determine which of the two source organizations we would decide has the correct value for an attribute.
The argument currently being called is still only the sample data, which is to say it's not related to how we might actually prioritize sources. Because we have duplicates, we will likely have conflicts over certain attributes, so we probably want to determine what the actual source field prioritization is (or we can decide we won't use such prioritization at all and just go with the attributes of whichever is the first row in a set of duplicates being merged).