datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.51k stars 303 forks source link

False positives #307

Closed ajabini closed 3 years ago

ajabini commented 3 years ago

I was wondering why there are a lot of false positives in the model. examples: "No that is it!", "sure, the main one is jacramer12@gmail.com", "yes please". I was trying to train with the examples I found in my dataset. But I'm afraid this imbalance is more severe. Any suggestions?

ajabini commented 3 years ago

I ended up training a BERT-based classifier to get the address-strings and use this as parser.