Special characters in GEF-file raise UnicodeDecodeError

martijnkriebel commented 1 year ago

Dutch GEF-files may contain special characters, for example the umlaut in the word "coördinatensysteem". This raises the UnicodeDecodeError below when parsing the file, which traces back to codecs.py. Replacing the "ö" with a regular "o" solves the issue.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Input In [19], in <cell line: 5>()

      4 cpt_gef = GefCpt()
----> 5 cpt_gef.read(path)
      6 cpt_gef.coordinates

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\cpt_base_model.py:220, in AbstractCPT.read(self, filepath)
    [217](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=216)     raise FileNotFoundError(filepath)
    [219](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=218) cpt_reader = self.get_cpt_reader()
--> [220](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=219) cpt_data = cpt_reader.read_file(filepath)
    [221](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=220) for cpt_key, cpt_value in cpt_data.items():
    [222](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=221)     setattr(self, cpt_key, cpt_value)

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\gef_cpt\gef_file_reader.py:165, in GefFileReader.read_file(self, filepath)
    [164](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=163) def read_file(self, filepath: Path) -> dict:
--> [165](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=164)     return self.read_gef(gef_file=filepath)

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\gef_cpt\gef_file_reader.py:174, in GefFileReader.read_gef(self, gef_file, fct_a)
    [172](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=171) # read gef file
    [173](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=172) with open(gef_file, "r") as f:
--> [174](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=173)     data = f.readlines()
    [176](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=175) # search NAP
    [177](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=176) idx_nap = GefFileReader.get_line_index_from_data_starts_with(
    [178](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=177)     code_string=r"#ZID=", data=data
    [179](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=178) )

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    [319](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=318) def decode(self, input, final=False):
    [320](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=319)     # decode input (taking the buffer into account)
    [321](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=320)     data = self.buffer + input
--> [322](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=321)     (result, consumed) = self._buffer_decode(data, self.errors, final)
    [323](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=322)     # keep undecoded input until the next call
    [324](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=323)     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 626: invalid start byte

EleniSmyrniou commented 1 year ago

I am not sure if gef file with Dutch characters would work. For the gef reading we are using fields that are described in the "Geotechnical exchange format for cpt-data". GEF-CPT.pdf If you attach the original gef file then I can take a closer look.

ghost commented 1 year ago

Hi Martijn,

The CUR standard clearly states that the GEF file should only consist of characters in the ASCII charachter set (only 128 characters found here).

The GEF file is parsed using utf-8, which is the most used encoding on the web with all possible charachters (in all languages), the original 128 characters from ASCII are mapped to the same bytes in 'utf-8). For obvious compatibility reasons.

Your GEF file is probably encoded in cp1252 (ANSI) encoding. Which is an extension that adds some extra characters to the set which are used in westen european languages. Unfortunally these special characters map to different byte(s) in utf-8 and cp1252. (because cp1252 is a single byte encoding and utf-8 a multiple byte encoding). Actually the byte of ö in 'windows-1252' (0xf6) is not a valid byte used in 'utf-8'. That is what is causing the problem, otherwise you would just get the wrong character out instead of an error.

Easy fix for you is to open de gef in notepad (kladblok) and save the file in 'UTF-8'. The GEF file wil probably parse correct including the ö. afbeelding

Another fix to try (in pyhton) is to try to decode the file using 'utf-8', id this fails, catch the error en decode the file using cp1252 and then re-encode the file using utf-8.

with open('file.gef', 'rb') as fp:
    try:
        file_as_string = fp.read().decode('utf-8')
        # everything alright send file to GEOLIB+
    except UnicodeDecodeError:
        # File is probably cp1252 with special character, convert to utf-8
        file_as_string = fp.read().decode('cp1252')
        file_as_bytes_utf_8 = file_as_string.encode('utf-8')

martijnkriebel commented 1 year ago

Hi Maarten,

Thanks for the detailed explanation! The funny part is that the #DATAFORMAT header of the GEF file says it's ASCII-encoded like specified in the standard, even though it's clearly not 😄

I remember trying to change the file encoding, but failed back then and switched to a different approach for the project that didn't involve this code. Somehow I currently cannot reproduce the error I initially got, even though I'm parsing the same GEF file which is ANSI-encoded and contains the ö-character. If I encounter the same problem another time I'll try your solutions!

Deltares / GEOLib-Plus

Special characters in GEF-file raise UnicodeDecodeError #6