CodeForPittsburgh / food-access-map-data

Data for the food access map
MIT License
8 stars 18 forks source link

Fix Weird Merge Logic #203

Closed maxachis closed 2 years ago

maxachis commented 2 years ago

If you take a look at the Merge.log file, you may note that some rows appear to be improperly merged -- these can be things on the same street but are different entities, or they can be the same location but performing different functionality with different types -- for example, a Farmer's Market and a Grow PGH Garden at the same location.

We need to figure out how to clarify this for the deduper.

maxachis commented 2 years ago

A simple way to do this would be to refuse to merge if the two entities are of different types. I'll work on implementing that.

maxachis commented 2 years ago

OR, I simply add a new parameter to the existing merge logic so that type is taken into account as well. I will do that instead.

maxachis commented 2 years ago

I am waiting on merge logging in #173 to complete before I move forward with this -- it'll be easier to see how this affects merge logic once I can do that.

maxachis commented 2 years ago

With merge logging, I'm pinpointing problematic merges. Fortunately, the nature of the merge system means I can just put this in as a test case of when to NOT consider them duplicates. I just need to update the training.json file to account for this as distinct. image

maxachis commented 2 years ago

I probably need to also update the merge logging testing to something that's not such a pain. As, is, relying on an external csv that's a sample of a now year-old dataset has some problems. If we want to add new test cases based on new data, we'd have to modify that csv, which can make it a pain when you're testing and you have to switch back and forth between the csv and the test cases to make sure you're referring to the right rows.

I think it needs to be refactored so the deduper is tested on small datasets that are created within the test file, rather than rely on a large external test csv. That might mean that the deduper isn't being tested on a perfect simulacrum of the actual dataset, but it does make it easier to modify, add, and remove test cases.

At any rate, this is a low priority, because the duplicate data isn't the most egregious thing.

hellonewman commented 2 years ago

Larry fixed!