NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
12 stars 3 forks source link

Add Address Parsing and Comparison with Postal #40

Open jstammers opened 1 month ago

jstammers commented 1 month ago

This PR implements a method parse an address string using postal.

As this is a fairly heavy dependency particularly given the model data required by libpostal, it is included as an optional package.

A comparison function is also implemented that classifies the level of comparison between each field in the address. Currently, this returns a StructColumn, but it may be more convenient to return multiple columns using t.lift(), or define a Comparer to classify the total level of agreement between the two addresses.

Since postal only accepts scalar values for parsing and classifying dupes, it's likely that these functions might scale poorly when dealing with millions of records.

Finally, I have included a function to calculate the Levenshtein ratio between two strings that replicates the implementation here

NickCrews commented 1 month ago

I just updated main underneath you, so you're going to get merge conflicts, sorry! let me know if you want me to do the rebase.

jstammers commented 1 month ago

Thanks for the quick feedback @NickCrews . I've implemented most of the suggestions so far. I've also added a hash_address function that I think could be useful for blocking on potentially similar addresses.

I haven't yet modified compare_addresses. I think this could be improved by first normalizing the address components using either postal.expand.expand_address or postal.expand.expand_address_root where applicable and looking for an intersection on those sets. This feels like it might clash with the existing normalize_address function, so perhaps something like expand_address_fields would be clearer

NickCrews commented 1 month ago

Lemme know when you want me to actually review this.

FYI I'm just thinking about how I want to be careful not putting too much "opinionated" code into the core. Eg someone else might want to do address parsing, cleaning, comparing differently, and I want to leave that option open to them. Possibly in the future some of this opinionated code I may want to move to a separate repo/package, but for now I think it makes sense to keep it on the core here so we can move fast and figure out the APIs. So keep doing what you're doing, but make sure your implementation keeps the door open for other people to add their own address processing stuff. Thank you!

jstammers commented 1 month ago

I agree that it's probably best to move some of this out of the core functionality, particularly if it enforces the use of a specific heavy-dependency such as postal. For now, I'll move a lot of the functionality that doesn't follow the API of the rest of the address module into my own codebase and let you know when what's left is ready for review

jstammers commented 1 month ago

@NickCrews I'm happy for this to be reviewed again whenever it is convenient for you

NickCrews commented 6 days ago

@jstammers sorry it took me so long to get to this. I have a lot of small requests, but the general approach looks great and I think will be a great addition! Thanks for taking the time to address my concerns.

The largest things are

jstammers commented 4 days ago

@NickCrews thanks for the feedback! Let me know what you think of these changes. At some point, I'd like to dig into the address parsing a little further. I've tested it out with a Japanese postal address and it seems like the parsed address isn't consistent with the Japanese address structure.

Following this example, I can see that there are indeed multiple postal components returned by the parser

from postal.parser import parse_address
parse_address("〒150-2345 東京都渋谷区本町2丁目4-7サニーマンション203", language="jp")

[('〒150-2345', 'postcode'),
 ('東', 'state'),
 ('京', 'city'),
 ('都渋谷', 'city_district'),
 ('区', 'city'),
 ('本町2丁目', 'suburb'),
 ('4-7', 'house_number'),
 ('サニーマンション203', 'house')]

Which is very different to the result I get when using the same address converted into english using the Japanese PO guidelines

parse_address("""Sunny Mansion #203
4-7 Hommachi 2-choume
Shibuya-ku, TOKYO 150-2345""")

[('sunny mansion', 'house'),
 ('#203', 'unit'),
 ('4-7', 'house_number'),
 ('hommachi', 'house'),
 ('2-choume', 'suburb'),
 ('shibuya-ku', 'city_district'),
 ('tokyo', 'city'),
 ('150-2345', 'postcode')]

I suspected this is related to the parser model not being sufficiently trained on japanese addresses, but perhaps one to note for future reference.

jstammers commented 3 days ago

Looking at some of the CI runs, I can see that getting postal to build and install properly is going to cause further headaches. I'll look into replicating the build steps for Mac OS X and Windows. If anyone knows of a convenient way to infer the OS so that the corresponding build script for postal could be run, that would be super helpful

NickCrews commented 2 days ago

I think it's fine if we only test postal on one OS, whichever is easiest. Then pytest.importorskip postal: https://docs.pytest.org/en/8.1.x/how-to/skipping.html#skipping-on-a-missing-import-dependency

NickCrews commented 2 days ago

@jstammers @lmores what do you think about renaming hash_address() to fingerprint_address()? Hash has a very specific connotation in CS (one way function that results in an int), and this isn't quite it. The downside is then this doesn't match the name of the postal function.

Also, what about prefixing/postfixinh these function names with postal, so it leaves the door open for us to add other implementations that do the same thing? Ie with the usaddress pypi package? I'm 50/50 on this idea

lmores commented 2 days ago

@NickCrews Here are my two cents:

a) If we do not go for the prefix/postfix, then renaming hash_address() to fingerprint_address() makes sense b) If we introduce "postal" as prefix/postfix I would suggest to use it as a prefix and stick to postal_hash_address() without renaming (though we could rename it in the future if we introduce another "address backend" to have uniform names after the prefix).

I am slightly in favor of using "postal" as prefix and follow (b), but it is not actually a strong preference.

jstammers commented 2 days ago

I'm leaning towards using a postal prefix/postfix to allow for future functions that implement address parsing/fingerprinting

NickCrews commented 2 days ago

Sounds good, let's go with the postal prefix! I actually lean decently towards postal_fingerprint_address instead of postal_hash_address, unless either of you have strong feelings against that, let's do that.

jstammers commented 2 hours ago

@NickCrews it seems that including postal within the dev dependencies is causing the CI step to fail to install the dev dependencies unless postal can be built. Would this be better off as an optional dependency?