etalab / geozones

Simple spatial/administrative referential
44 stars 10 forks source link

Wrong encoding for arrondissements #41

Open davidbgk opened 6 years ago

davidbgk commented 6 years ago

See http://www.data.gouv.fr/fr/datasets/geozones/#discussion-5a4e3ae6c751df376c39672a for details.

Probably a missing conversion in extract_french_district with 'wikipedia': props['wikipedia'], proposal:

'wikipedia': props['wikipedia'] and unicodify(
             props['wikipedia'].encode('latin-1').decode('utf-8')) or '',
citizenu03bb commented 4 years ago

Hi, The same problem arises with the towns name in geozones-france-2019-0-json.tar.xz. I corrected my imported data with:

def utf8_recode(value: str, src: str = "latin-1") -> Optional[str]:
    """
    >>> utf8_recode('Château-Thierry')
    'Château-Thierry'
    >>> utf8_recode('Château-Thierry')
    'Château-Thierry'
    >>> utf8_recode(None)
    """
    if value in (None, ""):
        return None
    if "\xc3" in value:
        return value.encode(src).decode("utf-8")
    return value