Toblerity / Fiona

Fiona reads and writes geographic data files
https://fiona.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1.14k stars 201 forks source link

Field Name encode error when writing to Mapinfo File #1399

Open alexJhao opened 1 month ago

alexJhao commented 1 month ago

Expected behavior and actual behavior.

I read data from GPKG, then write data to a Mapinfo TAB file. I found the field name error encode.

>>> with fiona.open(gpkg_fp,'r') as src:
...     driver=src.driver
...     crs=src.crs
...     schema=src.schema
...     feat=src[1]
... 
>>> driver
'GPKG'
>>> crs
CRS.from_epsg(4326)
>>> schema
{'properties': {'地市': 'str:80', '区县': 'str:80', '商业街名称': 'str:80'}, 'geometry': 'Polygon'}

>>> with fiona.open(r'e:\temp\7\a.tab','w',driver='MapInfo File',crs=crs,schema=schema) as dst:
...     dst.write(feat)
... 
>>> with fiona.open(r'e:\temp\7\a.tab','r') as dst1:  
...     new_schema=dst1.schema
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "D:\miniconda3\envs\geo_py3b5\Lib\site-packages\fiona\collection.py", line 293, in schema
    self._schema = self.session.get_schema()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "fiona\\ogrext.pyx", line 761, in fiona.ogrext.Session.get_schema
UnicodeDecodeError: 'gbk' codec can't decode byte 0xd0 in position 3: incomplete multibyte sequence

I use binary mode to open the a.tab file. found the field name changed.

>>> with open(r'e:\temp\7\a.tab','rb') as f:
...     context=f.read()
... 
>>> context
b'!table\n!version 300\n!charset Neutral\n\nDefinition Table\n  Type NATIVE Charset "Neutral"\n  Fields 3\n    _\xd8\xca\xd0 Char (80) ;\n    \xc7\xf8\xcf\xd8 Char (80) ;\n    \xc9\xcc\xd2__\xd6\xc3\xfb_\xc6 Char (80) ;\n'

in the TAB file, three field names is:

_\xd8\xca\xd0
\xc7\xf8\xcf\xd8
\xc9\xcc\xd2__\xd6\xc3\xfb_\xc6

but the original field name are: '地市', '区县', '商业街名称' their 'ansi' code are :

>>> [x.encode('ansi') for x in ['地市','区县','商业街名称']]
[b'\xb5\xd8\xca\xd0', b'\xc7\xf8\xcf\xd8', b'\xc9\xcc\xd2\xb5\xbd\xd6\xc3\xfb\xb3\xc6']

b'\xb5\xd8\xca\xd0' become  b'_\xd8\xca\xd0'
b'\xc9\xcc\xd2\xb5\xbd\xd6\xc3\xfb\xb3\xc6'  become  b'\xc9\xcc\xd2__\xd6\xc3\xfb_\xc6'

Operating system

Win10

Fiona and GDAL version and provenance

python 3.11.5 GDAL 3.6.2 fiona 1.9.6

sgillies commented 1 month ago

Hi @alexJhao. From the information you gave

python 3.11.5
GDAL 3.6.2
fiona 1.9.6

it looks like you have built Fiona from its source. Is that true? The WIndows distributions on pypi.org have GDAL version 3.8.4.

Can you check to see that your GDAL library was built with support for the iconv library that provides internationalization support?

alexJhao commented 1 month ago

Hi @alexJhao. From the information you gave

python 3.11.5
GDAL 3.6.2
fiona 1.9.6

it looks like you have built Fiona from its source. Is that true? The WIndows distributions on pypi.org have GDAL version 3.8.4.

Can you check to see that your GDAL library was built with support for the iconv library that provides internationalization support?

I not sure whether built Fiona from its source or not. I use "conda install geopandas". I also search iconv on GDAL document. It is said the iconv is completed from 1.6.0 release. https://gdal.org/development/rfc/rfc23_ogr_unicode.html#encoding-names

sgillies commented 1 month ago

@alexJhao thank you. I don't use MapInfo and am not an expert on the format, so I hope I do not lead you off course. I wonder if you need to use the encoding option when creating the MapInfo dataset? See https://gdal.org/drivers/vector/mitab.html#layer-creation-options. For example, like

>>> with fiona.open(r'e:\temp\7\a.tab', 'w', driver='MapInfo File', crs=crs, schema=schema, encoding='GBK') as dst:
...     dst.write(feat)

Or maybe UTF-8 would be better. I'm not sure.

alexJhao commented 1 month ago

@alexJhao thank you. I don't use MapInfo and am not an expert on the format, so I hope I do not lead you off course. I wonder if you need to use the encoding option when creating the MapInfo dataset? See https://gdal.org/drivers/vector/mitab.html#layer-creation-options. For example, like

>>> with fiona.open(r'e:\temp\7\a.tab', 'w', driver='MapInfo File', crs=crs, schema=schema, encoding='GBK') as dst:
...     dst.write(feat)

Or maybe UTF-8 would be better. I'm not sure.

Had tried UTF-8 also, same result😓

I write a script to solved it temporarily.

def gdf2Tab(data: gpd.GeoDataFrame, filename: str, encoding="cp936"):
    """solved field name encoding error in Mapinfo Tab file

    Args:
        data (gpd.GeoDataFrame): gdf data
        filename (str): saved file_name
        encoding (str, optional): same as gpd.to_file. Defaults to "cp936". 

    """
    assert isinstance(data, gpd.GeoDataFrame)
    tab_fp = Path(filename)
    assert tab_fp.name.find("tab") > -1

    columns = data.columns.tolist()
    columns.remove("geometry")
    data.to_file(filename=filename, driver="MapInfo File", encoding=encoding)

    tmp_tab_fp = Path(tab_fp.parent / Path("tmp_" + tab_fp.name))

    # read all line in Tab File
    with open(filename, "rb") as source:
        new_lines = source.readlines()

    # mark the first line no with 'field'
    line_no_s = -1
    field_count = 0
    for idx, line in enumerate(new_lines):
        if line.find(b"Fields") > -1:
            line_no_s = idx + 1
            field_count = int(line.strip().split(b" ")[1])
            break

    # change field name with ansi coding
    for field_idx in range(field_count):
        line_no = line_no_s + field_idx
        line_bytes = new_lines[line_no]

        line_byte = line_bytes.split(b" ")
        field_bytes_idx = 0
        for tmp_j in range(len(line_byte)):
            if line_byte[tmp_j] != b"":
                field_bytes_idx = tmp_j
                break
        field_byte = line_byte[field_bytes_idx]
        if len(field_byte) <= 0:
            break
        new_field_byte = columns[field_idx].encode("ansi")
        line_byte[field_bytes_idx] = new_field_byte

        new_lines[line_no] = b" ".join(line_byte)

    with open(str(tmp_tab_fp), "wb") as target:
        target.writelines(new_lines)

    tmp_tab_fp.replace(filename)

    return True
sgillies commented 3 weeks ago

On the Fiona main branch I see a KeyError when I try to reproduce with the following code:

def test_issue1399(tmp_path):
    """Test schema encoding issue reported in #1399."""
    schema = {
        "properties": {"地市": "str:80", "区县": "str:80", "商业街名称": "str:80"},
        "geometry": "Polygon",
    }
    with fiona.open(
        tmp_path / "a.tab",
        "w",
        driver="MapInfo File",
        crs=CRS.from_epsg(4326),
        schema=schema,
    ) as colxn:
        pass
fiona/collection.py:682: in __exit__
    self.close()
fiona/collection.py:659: in close
    self.flush()
fiona/collection.py:649: in flush
    self.session.sync(self)
fiona/ogrext.pyx:1707: in fiona.ogrext.WritingSession.sync
    gdal_flush_cache(cogr_ds)
fiona/ogrext.pyx:86: in fiona.ogrext.gdal_flush_cache
    with cpl_errs:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise exception_map[err_no](err_type, err_no, msg)
E   KeyError: 502

fiona/_err.pyx:196: KeyError

Logs mention MapInfo charset. The 502 error code is specific to MapInfo and is not one of the usual GDAL error codes that Fiona expects.

------------------------------------------ Captured log call ------------------------------------------
DEBUG    fiona._env:env.py:315 GDAL data files are available at built-in paths.
DEBUG    fiona._env:env.py:315 PROJ data files are available at built-in paths.
DEBUG    fiona.ogrext:collection.py:229 File doesn't exist. Creating a new one...
WARNING  fiona._env:collection.py:229 Cannot find MapInfo charset corresponding to iconv GBK encoding
DEBUG    fiona._env:env.py:315 GDAL data files are available at built-in paths.
DEBUG    fiona._env:env.py:315 PROJ data files are available at built-in paths.
WARNING  fiona._env:collection.py:229 Cannot find MapInfo charset corresponding to iconv GBK encoding
DEBUG    fiona._env:env.py:315 GDAL data files are available at built-in paths.
DEBUG    fiona._env:env.py:315 PROJ data files are available at built-in paths.
DEBUG    fiona.ogrext:collection.py:229 Created layer a
DEBUG    fiona.ogrext:collection.py:229 Writing started
DEBUG    fiona._env:env.py:315 GDAL data files are available at built-in paths.
DEBUG    fiona._env:env.py:315 PROJ data files are available at built-in paths.
INFO     fiona._env:collection.py:649 Unknown error number 502.
INFO     fiona._env:collection.py:649 Unknown error number 502.

From looking at https://github.com/rouault/gdal/blob/65e177b7e3277bc3f39d64ae44796a8c813f4129/ogr/ogrsf_frmts/mitab/mitab_utils.cpp#L485 and the code below it, I think it's possible that MapInfo doesn't support non-Latin characters for field names. Is that true @rouault ?

rouault commented 3 weeks ago

Is that true @rouault ?

Good question to which I don't know the answer. Maybe @drons who introduced support for encodings in the mapinfo driver knows. Perhaps the "laundering" of characters of code >= 192 done in TABCleanFieldName() in mitab_utils.cpp should be removed when using a charset other than the default neutral one?

drons commented 3 weeks ago

Good question...

TABCleanFieldName came to us to support older versions of MapInfo. I think at the moment we can refuse "laundering" of characters of code >= 192 for non-neutral charset files.

Moreover, modern Mapinfo supports UTF-8 encoding, but GDAL don't (see mitab_imapinfofile.cpp apszCharsets list).