NickCrews / spacy-address

Parse oneline US addresses using a spaCy NER model trained on OSM data
MIT License
4 stars 0 forks source link
address address-parsing osm osm-data spacy spacy-nlp usaddress

spacy-address

Use spaCy's NER pipeline to parse oneline US addresses

Uses the the labeled data from usaddress with spaCy's very easy training infrastructure

Inspired by the code and blog from https://github.com/swapnil-saxena/address-parser.

Usage

There are currently two models, en-us-address-ner-sm and en-us-address-ner-lg, following the naming conventions for small and large that spaCy uses.

en-us-address-ner-sm

You probably want this one. Much better efficiency for not much worse accuracy.

As of 2024-10-06:

en-us-address-ner-lg

Much larger and slower, a little more accurate.

As of 2024-10-06:

You can find the released models in various github releases. There, you can see the most up to date model size and F1 score. The speed isn't reported anywhere easily, unfortunately.

You can install from a release directly with pip:

python -m pip install "en-us-address-ner-sm @ https://github.com/NickCrews/spacy-address/releases/download/20241029-205717-sm/en_us_address_ner_sm-0.0.0-py3-none-any.whl"

Now, this is accessible from python:

import spacy

nlp = spacy.load("en-us-address-ner-sm")
doc = nlp("CO John SMITH, 123 E St elias stree S,   Oklahoma City, OK 99507-1234")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")
# CO John SMITH (Recipient)
# 123 (AddressNumber)
# E (StreetNamePreDirectional)
# St elias (StreetName)            # St isn't confused as an abbreviation for street!
# stree (StreetNamePostType)       # Typos are tagged correctly!
# S (StreetNamePostDirectional)
# Oklahoma City (PlaceName)        # Oklahoma isn't confused as a state!
# OK (StateName)
# 99507-1234 (ZipCode)

# For convenience I include the taggings for autcomplete, IDE support, etc
from en_us_address_ner_sm import labels
[ent.text for ent in doc.ents if ent.label_ == labels.StateName]
# ['OK']

This uses the tags from the "United States Thoroughfare, Landmark, and Postal Address Data Standard (Publication 28)". See labels.py for details

Goals

I have tried using various probabilstic address parsers/taggers. None of them quite suited my needs. Here is what I was aiming for

Comparison vs Peers

Here is an incomplete list of how this project compares with some other projects I've tried:

Libpostal

PyPostal

The python bindings to libpostal.

USAddress

Licence

Released under the MIT license.