Unicode Characters in Strings get parsed to strange characters

msimmoteit-neozo commented 5 months ago

Describe the bug When a String field inside a QVD file contains non-ASCII characters, the parsed value contains characters that are not in the original string.

To Reproduce Steps to reproduce the behavior:

Create a QVD DataFrame which contains non ASCII characters
Load the QVD from the file
View the data
See error

>>> df = QvdDataFrame.from_dict({"columns": ["DATA"], "data": [["São Tomé and Príncipe"]]})
>>> df.at(0,"DATA")
'São Tomé and Príncipe'
>>> df.to_qvd("sample.qvd")
>>> df2 = QvdDataFrame.from_qvd("sample.qvd")
>>> df2.at(0,"DATA")
'SÃ£o TomÃ© and PrÃ\xadncipe'

Expected behavior

>>> qvd_df.at(0, "DATA")
'São Tomé and Príncipe'

Additional context I believe this stems from lines 434 and following from qvd.py. I think one way to parse the data into string was to use the Python bytes.decode() method.

MuellerConstantin commented 5 months ago

Thanks for pointing out that bug. You're right, a byte-encoded string should be decoded as a whole and not byte by byte, otherwise UTF-8 bytes that may belong together will be interpreted as separate bytes. That leads to this error.

byte_data = bytearray()

while symbol_buffer[pointer] != 0:
    byte_data.append(symbol_buffer[pointer])
    pointer += 1

value = byte_data.decode(encoding="utf-8")

I'm working on it and will release a corresponding patch release.

MuellerConstantin commented 5 months ago

Bug has been fixed with the latest patch release v1.1.1. Reading/Writing unicode strings in QVD tables should be possible now.

MuellerConstantin / PyQvd

Unicode Characters in Strings get parsed to strange characters #3