faust-streaming / cChardet

universal character encoding detector
Other
56 stars 4 forks source link

BUG: Incorrect Encoding Detection as Big5 #33

Open dxdc opened 2 months ago

dxdc commented 2 months ago

OS/Arch

macOS

Python version

python 3.11

cChardet version

2.1.18

What is the problem?

cChardet is incorrectly detecting the encoding of this file as Big5. I'm not sure if the issue should be posted here or elsewhere.

import cchardet as chardet

with open('abc_1.csv', 'rb') as f:
    result = chardet.detect(f.read())
print(result)
# result:
{'encoding': 'BIG5', 'confidence': 0.9900000095367432}

Expected behavior

Actual behavior

Steps to reproduce the behavior

** Attached file

wbarnha commented 2 months ago

Thanks for providing a reference file! I see that the main project has been updated recently, so I'm going to pull in changes and see if there's any luck with the original maintainer's fixes.

dxdc commented 2 months ago

Thanks for the checking this out so quickly! It was difficult to create that reference file, but hopefully it gives some insights as to where the problem came from.

dxdc commented 1 month ago

hi @wbarnha ... did you have any luck with this one?