Open jstammers opened 1 month ago
I just updated main underneath you, so you're going to get merge conflicts, sorry! let me know if you want me to do the rebase.
Thanks for the quick feedback @NickCrews . I've implemented most of the suggestions so far. I've also added a hash_address
function that I think could be useful for blocking on potentially similar addresses.
I haven't yet modified compare_addresses
. I think this could be improved by first normalizing the address components using either postal.expand.expand_address or postal.expand.expand_address_root where applicable and looking for an intersection on those sets. This feels like it might clash with the existing normalize_address
function, so perhaps something like expand_address_fields
would be clearer
Lemme know when you want me to actually review this.
FYI I'm just thinking about how I want to be careful not putting too much "opinionated" code into the core. Eg someone else might want to do address parsing, cleaning, comparing differently, and I want to leave that option open to them. Possibly in the future some of this opinionated code I may want to move to a separate repo/package, but for now I think it makes sense to keep it on the core here so we can move fast and figure out the APIs. So keep doing what you're doing, but make sure your implementation keeps the door open for other people to add their own address processing stuff. Thank you!
I agree that it's probably best to move some of this out of the core functionality, particularly if it enforces the use of a specific heavy-dependency such as postal
.
For now, I'll move a lot of the functionality that doesn't follow the API of the rest of the address module into my own codebase and let you know when what's left is ready for review
@NickCrews I'm happy for this to be reviewed again whenever it is convenient for you
@jstammers sorry it took me so long to get to this. I have a lot of small requests, but the general approach looks great and I think will be a great addition! Thanks for taking the time to address my concerns.
The largest things are
@NickCrews thanks for the feedback! Let me know what you think of these changes. At some point, I'd like to dig into the address parsing a little further. I've tested it out with a Japanese postal address and it seems like the parsed address isn't consistent with the Japanese address structure.
Following this example, I can see that there are indeed multiple postal components returned by the parser
from postal.parser import parse_address
parse_address("〒150-2345 東京都渋谷区本町2丁目4-7サニーマンション203", language="jp")
[('〒150-2345', 'postcode'),
('東', 'state'),
('京', 'city'),
('都渋谷', 'city_district'),
('区', 'city'),
('本町2丁目', 'suburb'),
('4-7', 'house_number'),
('サニーマンション203', 'house')]
Which is very different to the result I get when using the same address converted into english using the Japanese PO guidelines
parse_address("""Sunny Mansion #203
4-7 Hommachi 2-choume
Shibuya-ku, TOKYO 150-2345""")
[('sunny mansion', 'house'),
('#203', 'unit'),
('4-7', 'house_number'),
('hommachi', 'house'),
('2-choume', 'suburb'),
('shibuya-ku', 'city_district'),
('tokyo', 'city'),
('150-2345', 'postcode')]
I suspected this is related to the parser model not being sufficiently trained on japanese addresses, but perhaps one to note for future reference.
Looking at some of the CI runs, I can see that getting postal
to build and install properly is going to cause further headaches. I'll look into replicating the build steps for Mac OS X and Windows. If anyone knows of a convenient way to infer the OS so that the corresponding build script for postal could be run, that would be super helpful
I think it's fine if we only test postal on one OS, whichever is easiest. Then pytest.importorskip postal: https://docs.pytest.org/en/8.1.x/how-to/skipping.html#skipping-on-a-missing-import-dependency
@jstammers @lmores what do you think about renaming hash_address() to fingerprint_address()? Hash has a very specific connotation in CS (one way function that results in an int), and this isn't quite it. The downside is then this doesn't match the name of the postal function.
Also, what about prefixing/postfixinh these function names with postal, so it leaves the door open for us to add other implementations that do the same thing? Ie with the usaddress pypi package? I'm 50/50 on this idea
@NickCrews Here are my two cents:
a) If we do not go for the prefix/postfix, then renaming hash_address() to fingerprint_address() makes sense b) If we introduce "postal" as prefix/postfix I would suggest to use it as a prefix and stick to postal_hash_address() without renaming (though we could rename it in the future if we introduce another "address backend" to have uniform names after the prefix).
I am slightly in favor of using "postal" as prefix and follow (b), but it is not actually a strong preference.
I'm leaning towards using a postal prefix/postfix to allow for future functions that implement address parsing/fingerprinting
Sounds good, let's go with the postal prefix! I actually lean decently towards postal_fingerprint_address instead of postal_hash_address, unless either of you have strong feelings against that, let's do that.
@NickCrews it seems that including postal
within the dev dependencies is causing the CI step to fail to install the dev dependencies unless postal
can be built. Would this be better off as an optional dependency?
This PR implements a method parse an address string using postal.
As this is a fairly heavy dependency particularly given the model data required by
libpostal
, it is included as an optional package.A comparison function is also implemented that classifies the level of comparison between each field in the address. Currently, this returns a
StructColumn
, but it may be more convenient to return multiple columns usingt.lift()
, or define aComparer
to classify the total level of agreement between the two addresses.Since
postal
only accepts scalar values for parsing and classifying dupes, it's likely that these functions might scale poorly when dealing with millions of records.Finally, I have included a function to calculate the Levenshtein ratio between two strings that replicates the implementation here