dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
404 stars 216 forks source link

Add Gazetteer Postgres example #109

Closed jeancochrane closed 4 years ago

jeancochrane commented 4 years ago

Overview

Add a new Gazetteer example to perform static matching backed by a Postgres database with updates.

In addition, containerize the Gazetteer examples.

Notes

A requirement of this example is to support Dedupe 1.x. Once this PR is approved and merged, and confirmed to work for the folks who requested it, I'll port the example to Dedupe 2.x.

Testing instructions

jeancochrane commented 4 years ago

@fgregg This PR overrides the index() and _blockData() methods to use a database backend. I decided to override _blockData() because it seemed like the key piece of both match() and threshold() that needed to be refactored to using the database, but let me know if there are other pieces I missed. Curious to get your high-level take before I wire it up to make sure it runs.

fgregg commented 4 years ago

looks very good.

jeancochrane commented 4 years ago

Alright, this should be ready for another look @fgregg. Since we have to support both the 1.x and 2.x APIs in this example directory I went ahead and containerized them.

fgregg commented 4 years ago

Let's just keep this dedupe 1.10.

We can bring a dedupe 2.0 version into master, and let's not do Docker, since don't have that with any of the other examples.

jeancochrane commented 4 years ago

@fgregg The issue is that the existing Gazetteer example is dedupe 2.x, but we need to support dedupe 1.x in the new example. However, I'm trying to bootstrap on the existing code sample so that we don't have to duplicate all of the training code in the new example using the 1.x API.

I think our options here are:

fgregg commented 4 years ago

Got it! Thank you for helping me see that context.

I'd recommend: Duplicate the training code from the existing example in the 1.x example

jeancochrane commented 4 years ago

Makes sense! I'll go ahead with that then.

jeancochrane commented 4 years ago

@fgregg I stripped out the Docker environment, updated the instructions to explain how to manage environments and run scripts with a local Python installation, and updated the Postgres script to be fully independent. I also added some more explanatory comments to the StaticDatabaseGazetteer class. Let me know how everything looks to you.

How would you like to handle bringing this into master? I don't love having two examples using two different versions of dedupe. The clearest path to me seems to be to bring this into master, share it via a commit ID with the folks who requested it, and then bring in a new PR that updates to 2.0 once we've confirmed that it worked for them. We could also consider having a long-lived branch to share with them, but I don't like the maintenance burden that that implies.