Add Actual Source Field Prioritization?

maxachis commented 3 years ago

Currently, we have a file called "data_prep_scripts/source_field_prioritization_sample_data.csv". This file, which is called by auto_merge_duplicates_wrapper.R, was designed provide an example of how we would prioritize different data sets which give different values for certain attributes of the same, such as whether SNAP was or was not accepted. Essentially, if two addresses were found to be duplicates of the same location, we would look at the source organization of each location, and use a source field prioritization chart to determine which of the two source organizations we would decide has the correct value for an attribute.

The argument currently being called is still only the sample data, which is to say it's not related to how we might actually prioritize sources. Because we have duplicates, we will likely have conflicts over certain attributes, so we probably want to determine what the actual source field prioritization is (or we can decide we won't use such prioritization at all and just go with the attributes of whichever is the first row in a set of duplicates being merged).

hellonewman commented 3 years ago

Uploaded new source prioritization file for @maxachis to check out.

maxachis commented 3 years ago

The unit tess pass, and running the merged_dataset shows the merges occurring. I was originally just going to say "Looks good to me" and then merge the branches, but I think it's better to be safe and have a look at the rows prior to merge to make sure they are merging as expected. It'd be convenient -- at least in this unique situation -- if we have a method to better audit this stuff, which I describe in Issue #173. Still, for now I'll need to do some manual verification before I'm totally comfortable with merging.

hellonewman commented 3 years ago

@maxachis Any chance you'd be able to take a dive into this, or nominate someone to do a final check before we merge?