GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning
https://deepparse.org/
GNU Lesser General Public License v3.0
299 stars 30 forks source link

PO Boxes #140

Closed crtnx closed 1 year ago

crtnx commented 2 years ago

Dear friends,

the paser works well for generic street addresses, but when I've tried to parse a PO Box US address, it fails:

parsed_address = address_parser("PO Box 40070 Nashville TN 37204")

[('40070', 'StreetNumber'), ('po box', 'StreetName'), (None, 'Unit'), ('nashville', 'Municipality'), ('tn', 'Province'), ('37204', 'PostalCode'), (None, 'Orientation'), (None, 'GeneralDelivery'), (None, 'EOS')]

Any plans to improve the training dataset? As far as I remember libpostal works well with PO Boxes and could generate PO Box addresses...

github-actions[bot] commented 2 years ago

Thank you for you interest in improving Deepparse.

davebulaval commented 2 years ago

Hi @crtnx,

The problem is that we don't have enough examples in the dataset. For example, only two examples of PO boxes exist in the training dataset and 3 in the test set (for US, Canada and UK addresses). Without more examples, it would be impossible to retrain our model to improve it.

Do you have some annotated examples?

Libpostal must use handwritten rules or something like it to solve these cases since we use a similar dataset.

crtnx commented 2 years ago

Hi @davebulaval,

thanks for providing me with the explanation. As OSM data has no PO Box addresses, libpostal has a builtin code for generating fake PO Boxes and replaces portions of real street addresses with it. I could generate enough POBox annotated examples for you using libpostal geodata generator, however please define the following:

davebulaval commented 2 years ago

We have the "GeneralDelivery" tag that could be used for this. Otherwise, we will need to introduce a new tag. What do you think (@MAYAS3 also)? I don't know if other countries use it. I know that Canada and USA use it, and I'm pretty sure UK also, but other than that, more research needs to be done. Usually, a thousand or so per country is more than enough. If you provide the dataset, I will use our training loop to add this data whit the rest of our training dataset. The best is a pickle format similar as our public dataset (a list of tuples (address, [tags list]).

MAYAS3 commented 2 years ago

We have the "GeneralDelivery" tag that could be used for this. Otherwise, we will need to introduce a new tag. What do you think (@MAYAS3 also)? I don't know if other countries use it. I know that Canada and USA use it, and I'm pretty sure UK also, but other than that, more research needs to be done. Usually, a thousand or so per country is more than enough. If you provide the dataset, I will use our training loop to add this data whit the rest of our training dataset. The best is a pickle format similar as our public dataset (a list of tuples (address, [tags list]).

We could use GeneralDelivery indeed! Perhaps even rename it to something like PO Box since it doesn't really represent an essential tag for the original dataset. We could also go with more granularity and have an additional tag for the PO Box number.

crtnx commented 2 years ago

Hi there @davebulaval, @MAYAS3,

sorry for delay, working with libpostal generator is not an easy task... In short, I was able to generate 89550 POB addresses for US, UK and CA. Unfortunately the data in native libpostal format, so some massaging is needed to convert it to whatever format you need. The data is contained in .TSV file where each line is like this:

en ca P.O./po_box Box/po_box #/po_box 1311/po_box |/FSEP M6C/postcode 1C0/postcode |/FSEP Toronto/city ,/SEP Canada/country

where the first field is a language, the second one is a 2-letter country code, and the third on is a labelled example. As you see, libpostal has its own 'po_box' tag for labeling PO boxes. Also there are some extra tags you might not need, like 'country', 'suburb', 'FSEP', 'SEP', etc.

It should be quite easy to convert libpostal examples to your format. It will take some delay if you want me to do this task, as I am running out of time. Or if you could do it on your side, let me know and I will upload the file for you.

davebulaval commented 2 years ago

I've just spoken with @MAYAS3, and he will handle the conversion. After that, I will handle the training of the new models.

It will take some time since I will first use our API to retrain the actual model to see if it helps for POB for a quick fix. Then, I will use our research code to train new models from scratch to maximize performance. Training from scratch for all models could take a month or two.

crtnx commented 2 years ago

Guys, I've just uploaded the file to this location: https://drive.google.com/file/d/18F9PbU6KPHKAiOevj5x8tWsUErYb6AbB/view?usp=sharing the link will be accessible during a week, so please download the file as soon as possible. And good luck with the retraining the new model.

davebulaval commented 2 years ago

Perfect! We will keep you updated.

MAYAS3 commented 2 years ago

Guys, I've just uploaded the file to this location: https://drive.google.com/file/d/18F9PbU6KPHKAiOevj5x8tWsUErYb6AbB/view?usp=sharing the link will be accessible during a week, so please download the file as soon as possible. And good luck with the retraining the new model.

Thanks @crtnx !

I've been able to download the file successfully. We'll keep you posted!

davebulaval commented 2 years ago

Hi @crtnx,

I will start, this week or so, a fine-tuning process for the three countries provided and will release a new model and a new table with performance.

Also, for a larger performance increase, I have manually created a list of countries that use PO boxes and the typical way to write it in the address. Is it possible for you to also add these countries? Here is the country, the PO box writing way and an example.

Training

Zero-shot testing

crtnx commented 2 years ago

Hi @davebulaval,

thanks for the heads up, I am glad to be helpful. Unfortunately I am pretty busy these days and not sure when I'll have a time to work on the other countries. Hopefully sooner than later ;)

Would like to make a comment about PO Box writing ways per country - thanks for providing me with this info, however libpostal has its own way to format PO Boxes I cannot deviate from, and I think it is good enough.

Here is the configuration snipper for Chech:

po_boxes:
    postovni_prihradka: &postovni_prihradka
        canonical: poštovní přihrádka
        sample: true
        canonical_probability: 0.8
        sample_probability: 0.2
        numeric:
            direction: left
            add_number_phrase: true
            add_number_phrase_probability: 0.2 # poštovní přihrádka 1234
    alphanumeric:
        default: *postovni_prihradka
        numeric_probability: 0.9 # poštovní přihrádka 123
        alpha_probability: 0.05 # poštovní přihrádka A
        numeric_plus_alpha_probability: 0.04 # poštovní přihrádka 123G
        alpha_plus_numeric_probability: 0.01 # poštovní přihrádka A123
        alpha_plus_numeric:
            whitespace_probability: 0.1
        numeric_plus_alpha:
            whitespace_probability: 0.1

and for France:

po_boxes:
    boite_postal: &boite_postal
        canonical: boîte postale
        abbreviated: bp
        sample: true
        canonical_probability: 0.3
        abbreviated_probability: 0.5
        sample_probability: 0.2
        numeric:
            direction: left
            add_number_phrase: true
            add_number_phrase_probability: 0.2 # BP No 1234
        numeric_probability: 1.0
    course_speciale: &course_speciale
        canonical: course spéciale
        abbreviated: cs
        sample: true
        canonical_probability: 0.3
        abbreviated_probability: 0.5
        sample_probability: 0.2
        numeric:
            direction: left
            add_number_phrase: true
            add_number_phrase_probability: 0.2 # BP No 1234
        numeric_probability: 1.0
    tri_service_arivee: &tri_service_arivee
        canonical: tri service arrivée
        abbreviated: tsa
        sample: true
        canonical_probability: 0.3
        abbreviated_probability: 0.5
        sample_probability: 0.2
        numeric:
            direction: left
            add_number_phrase: true
            add_number_phrase_probability: 0.2 # BP No 1234
        numeric_probability: 1.0
    case_postal: &case_postal
        canonical: case postale
        abbreviated: cp
        sample: true
        canonical_probability: 0.3
        abbreviated_probability: 0.5
        sample_probability: 0.2
        numeric:
            direction: left
            add_number_phrase: true
            add_number_phrase_probability: 0.2 # CP No 1234
        numeric_probability: 1.0
    alphanumeric:
        sample: false
        default: *boite_postal
        numeric_probability: 0.9 # BP 123
        alpha_probability: 0.05 # BP A
        numeric_plus_alpha_probability: 0.04 # BP 123G
        alpha_plus_numeric_probability: 0.01 # BP A123
        alpha_plus_numeric:
            whitespace_probability: 0.1
        numeric_plus_alpha:
            whitespace_probability: 0.1

Again, I don't know if all countries in your list are covered, but am hoping for best.

davebulaval commented 2 years ago

@crtnx Ohh I did not know that. Do your best to get most countries on the list! We will work with that. There is no rush.

davebulaval commented 2 years ago

Updates: 11/08: Yeah, performance is not as good as expected after a fine-tuning procedure. It barely reaches 80%. I think there is not enough data. I will try another training loop with a warm-up on the PO box data. 12/08: I've reached better performance with warmup training, but it lower performance on the rest of the dataset by 10%. I will try to mitigate the decrease in performance in the next few days.

davebulaval commented 2 years ago

@crtnx So far, there is a regression in the parser performance for non PO boxes address of about 10%. Also, both models' performances on PO box addresses are not as good. We will not release a new version of the model until we get more addresses.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity.
Stale issues will automatically be closed 30 days after being marked Stale
.