PyYoshi / cChardet

universal character encoding detector
Other
390 stars 50 forks source link

xed's detection is a bit better than cchardet's #103

Open JCCyC opened 2 months ago

JCCyC commented 2 months ago

OS/Arch

system='Linux', node='jclvdell', release='6.8.0-40-generic', version='#40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2', machine='x86_64'

Python version

3.10.12

cChardet version

2.1.7

What is the problem?

A file (attached) with the Euro sign is correctly understood as ISO-8859-15 by the xed editor, but cchardet sees it as ISO-8859-1

Expected behavior

Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou €313,84)

Actual behavior

Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou ¤313,84)

(Euro symbol appears as "¤")

Steps to reproduce the behavior

1) Get this file: pagininha2.html.gz

2) Do this:

$ gunzip pagininha2.html.gz
$ python
>>> import cchardet as chardet
>>> with open("pagininha2.html", "rb") as f:
...   msg = f.read()
...   result = chardet.detect(msg)
...   print(result)
... 
{'encoding': 'ISO-8859-1', 'confidence': 0.7640712261199951}
>>> 
milahu commented 3 days ago

for long inputs, i prefer charset_normalizer, but its slower than cchardet