CartoDB / data-services

CARTO internal geocoder PostgreSQL extension
25 stars 11 forks source link

Localization for geocoding #221

Open AbelVM opened 8 years ago

AbelVM commented 8 years ago

We may have different names for the same place even in the official language(s) of the country, having a strict match to geocode leads to many fails and "holes" in the results. V.g.:

As of today, loading a CSV with provinces of Spain may produce several holes always due to accents (accents in uppercase are not compulsory, so JAEN <> Jaén), optional articles, and the different co-official languages in different regions.

Maybe we should make use of fuzzy search like tsvector or trigrams

Tsvector sample pseudocode:

SELECT 
the_geom
FROM
geometries_table
ORDER BY
ts_rank(to_tsvector(storedname),to_tsquery(inputname))
ASC
LIMIT 1;

It would be much faster if we precompute a tsvector column in the geometries table.

More comments about this at: https://github.com/CartoDB/dataservices-api/issues/251

cc @ethervoid

AbelVM commented 8 years ago

For testing purposes, this CSV file has a list of official names of municipalities in Alicante province in Spain:

ALC_muni.csv.txt

Most of them are not recognized by CARTO

ethervoid commented 8 years ago

There is an enhancement for this problem https://github.com/CartoDB/cartodb/issues/9131

AbelVM commented 8 years ago

Oeeee! oeeee oeeee oeeee!

iriberri commented 8 years ago

Oh I just saw this issue! @AbelVM we had the intention in the past to make namedplaces search fuzzy. Most of the other geocoding processes' strings are being normalized except for this one, which makes it pretty bad with complex names (accents, hyphens, spaces...). For other processes what we do is to store a normalized name in the DB and then run a regexp over the input, normalizing it in the same way. I think this could be a nice leapfrog ;-) From Geonames we have a ton of synonyms per each place, but if accents (or other character) don't match, it will just fail.

AbelVM commented 8 years ago

:+1: for a leapfrog testing different approaches:

I'm game