spacy-address
Use spaCy's NER pipeline to parse oneline US addresses
Uses the the labeled data from usaddress
with spaCy's very easy training infrastructure
Inspired by the code and blog from https://github.com/swapnil-saxena/address-parser.
Usage
There are currently two models, en-us-address-ner-sm
and en-us-address-ner-lg
,
following the naming conventions for small and large that spaCy uses.
en-us-address-ner-sm
You probably want this one. Much better efficiency for not much worse accuracy.
As of 2024-10-06:
- F1 score for NER of .978
- model size of ~5MB
- on my 2021 M1 MacBook Pro, tags 1000 addresses in ~.2 sec
en-us-address-ner-lg
Much larger and slower, a little more accurate.
As of 2024-10-06:
- F1 score for NER of .982
- model size of ~420MB
- on my 2021 M1 MacBook Pro, tags 1000 addresses in ~2 sec
You can find the released models in various github releases.
There, you can see the most up to date model size and F1 score.
The speed isn't reported anywhere easily, unfortunately.
You can install from a release directly with pip:
python -m pip install "en-us-address-ner-sm @ https://github.com/NickCrews/spacy-address/releases/download/20241029-205717-sm/en_us_address_ner_sm-0.0.0-py3-none-any.whl"
Now, this is accessible from python:
import spacy
nlp = spacy.load("en-us-address-ner-sm")
doc = nlp("CO John SMITH, 123 E St elias stree S, Oklahoma City, OK 99507-1234")
for ent in doc.ents:
print(f"{ent.text} ({ent.label_})")
# CO John SMITH (Recipient)
# 123 (AddressNumber)
# E (StreetNamePreDirectional)
# St elias (StreetName) # St isn't confused as an abbreviation for street!
# stree (StreetNamePostType) # Typos are tagged correctly!
# S (StreetNamePostDirectional)
# Oklahoma City (PlaceName) # Oklahoma isn't confused as a state!
# OK (StateName)
# 99507-1234 (ZipCode)
# For convenience I include the taggings for autcomplete, IDE support, etc
from en_us_address_ner_sm import labels
[ent.text for ent in doc.ents if ent.label_ == labels.StateName]
# ['OK']
This uses the tags from the
"United States Thoroughfare, Landmark, and Postal Address Data Standard (Publication 28)".
See labels.py for details
Goals
I have tried using various probabilstic address parsers/taggers. None of them
quite suited my needs. Here is what I was aiming for
- Speed: I am doing bulk processing, I want to parse 10s of millions of addresses in <10 minutes (we're not there yet, see timings above.)
- Support for PO Boxes.
- Support for finegrained tagging, eg split "Aspen Avenue" into ("Aspen", StreetName), ("Avenue", StreetPostType).
I need this for performing entity resolution, I only really care about the street name.
- Python API
- An installation story that isn't a total pain.
Comparison vs Peers
Here is an incomplete list of how this project compares with some other projects
I've tried:
- appears fairly unmaintained. It still works, but mostly all development
appears to keep the lights on.
- pain in the butt to deploy. You have to build from source using git, gcc, no
pip
.
- Requires a few GB of data (this amount might not be totally correct, but anyways a lot)
- it's not thread safe, and hard to integrate with other projects.
See https://github.com/Maxxen/duckdb-postal/issues/1
- Uses OpenStreetMap, with addresses from around the world
- Uses a slightly different set of tags
from this project. They are less USPS-specific, more applicable for worldwide addresses,
but aren't as granular, ie we tag "NW" as a StreetPreDirectional, they don't
go to that level of detail. We tag post office boxes and rural routes, they don't.
Might be more differences than what I have here. I'm open to expanding to a different
tagging scheme, but I NEED to support USPS PO box tagging, and of separating out the
street name ("Aspen") vs street type ("Avenue").
- According to their very in depth blog post,
it's quite more technically specialized than the other solutions here.
Does a lot of hand-written, address-domain specific things like normalizing st -> street,
hand-parsing numbers with custom grammar rules, etc.
- uses something called a "averaged perceptron" which they claim is much faster than
conditional random fields (as usaddress uses). IDK what spacy is doing...
- I would be very interested in a speed comparison vs us, building a c app
that does bulk processing.
- They claim a ~99.5% accuracy, which is similar to what we achieve.
I would love to test this though.
The python bindings to libpostal.
- has same installation problems as the above.
- only provides a one-by-one inference API, no batch API
- I would be very interested in a speed comparison vs us.
- Python only. Installable with
pip
!
- Not actively maintained, but basic maintenance happens.
- Older architecture (conditional random fields, but that doesn't mean anything to me...)
- Uses same set of taggings as us.
- Also uses OpenStreetMap data.
- Tokenizes in python using hardcoded rules. in spacy, the tokenizer is another trained model.
IDK really the consequences of this.
- I would be very interested in a speed comparison vs us.
- IDK what their accuracy numbers are.
Licence
Released under the MIT license.