Open JCCyC opened 2 months ago
system='Linux', node='jclvdell', release='6.8.0-40-generic', version='#40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2', machine='x86_64'
3.10.12
2.1.7
A file (attached) with the Euro sign is correctly understood as ISO-8859-15 by the xed editor, but cchardet sees it as ISO-8859-1
Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou €313,84)
Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou ¤313,84)
(Euro symbol appears as "¤")
1) Get this file: pagininha2.html.gz
2) Do this:
$ gunzip pagininha2.html.gz $ python >>> import cchardet as chardet >>> with open("pagininha2.html", "rb") as f: ... msg = f.read() ... result = chardet.detect(msg) ... print(result) ... {'encoding': 'ISO-8859-1', 'confidence': 0.7640712261199951} >>>
for long inputs, i prefer charset_normalizer, but its slower than cchardet
OS/Arch
Python version
cChardet version
What is the problem?
A file (attached) with the Euro sign is correctly understood as ISO-8859-15 by the xed editor, but cchardet sees it as ISO-8859-1
Expected behavior
Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou €313,84)
Actual behavior
Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou ¤313,84)
(Euro symbol appears as "¤")
Steps to reproduce the behavior
1) Get this file: pagininha2.html.gz
2) Do this: