Open martijnkriebel opened 1 year ago
I am not sure if gef file with Dutch characters would work. For the gef reading we are using fields that are described in the "Geotechnical exchange format for cpt-data". GEF-CPT.pdf If you attach the original gef file then I can take a closer look.
Hi Martijn,
The CUR standard clearly states that the GEF file should only consist of characters in the ASCII charachter set (only 128 characters found here).
The GEF file is parsed using utf-8, which is the most used encoding on the web with all possible charachters (in all languages), the original 128 characters from ASCII are mapped to the same bytes in 'utf-8). For obvious compatibility reasons.
Your GEF file is probably encoded in cp1252 (ANSI) encoding. Which is an extension that adds some extra characters to the set which are used in westen european languages. Unfortunally these special characters map to different byte(s) in utf-8 and cp1252. (because cp1252 is a single byte encoding and utf-8 a multiple byte encoding). Actually the byte of ö in 'windows-1252' (0xf6) is not a valid byte used in 'utf-8'. That is what is causing the problem, otherwise you would just get the wrong character out instead of an error.
Easy fix for you is to open de gef in notepad (kladblok) and save the file in 'UTF-8'. The GEF file wil probably parse correct including the ö.
Another fix to try (in pyhton) is to try to decode the file using 'utf-8', id this fails, catch the error en decode the file using cp1252 and then re-encode the file using utf-8.
with open('file.gef', 'rb') as fp:
try:
file_as_string = fp.read().decode('utf-8')
# everything alright send file to GEOLIB+
except UnicodeDecodeError:
# File is probably cp1252 with special character, convert to utf-8
file_as_string = fp.read().decode('cp1252')
file_as_bytes_utf_8 = file_as_string.encode('utf-8')
Hi Maarten,
Thanks for the detailed explanation! The funny part is that the #DATAFORMAT header of the GEF file says it's ASCII-encoded like specified in the standard, even though it's clearly not 😄
I remember trying to change the file encoding, but failed back then and switched to a different approach for the project that didn't involve this code. Somehow I currently cannot reproduce the error I initially got, even though I'm parsing the same GEF file which is ANSI-encoded and contains the ö-character. If I encounter the same problem another time I'll try your solutions!
Dutch GEF-files may contain special characters, for example the umlaut in the word "coördinatensysteem". This raises the UnicodeDecodeError below when parsing the file, which traces back to codecs.py. Replacing the "ö" with a regular "o" solves the issue.