Improve clarity of the encoding error

frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data

https://framework.frictionlessdata.io

MIT License

711 stars 148 forks source link

Improve clarity of the encoding error #1376

Open aborruso opened 1 year ago

aborruso commented 1 year ago

Hi, I have this CSV file https://gist.github.com/aborruso/ec970c3a56596f9c014794466ce2f1d8

If I validate it via CLI I have

'charmap' codec can't decode byte 0x9d in position 3116: character maps to <undefined>

If I try to inspect it

head -c 3116 input.csv | tail -c -1

I get nothing special, I don't see a strange character.

How can I use this validate error message, to clean this CSV file?

Thank you

shashigharti commented 1 year ago

Hi @aborruso, Thank you for reporting!.

If I understand your question correctly, it should solve the issue:

frictionless validate finanziamenti.csv --encoding utf-8

aborruso commented 1 year ago

If I understand your question correctly, it should solve the issue:

Ok, thank you, I know, but why doesn't it map it automatically as utf-8? Shouldn't it do it automatically?

Thank you again

shashigharti commented 1 year ago

Thanks! we will check.

aborruso commented 1 year ago

Ok, thank you, I know, but why doesn't it map it automatically as utf-8? Shouldn't it do it automatically?

Moreover chardetect gives utf-8 with confidence 0.99

shashigharti commented 1 year ago

Thank you!

It seems to be a bug, I also checked and it is inferred as 'utf-8' with 99% confidence in command line, but same library in frictionless gives different result Windows-1252 (.73%)

roll commented 1 year ago

Hi, there are two aspects:

the underlying detection library (Python version of chardet) detects it as cp1252 unless we use a bigger buffer size frictionless describe tmp/finanziamenti.csv --buffer-size 1000000 (unfortunately, we can't fix the root cause on the Frictionless level)
anyway the error message is not clear enough so I've updated to issue to improve the error message

aborruso commented 1 year ago

the underlying detection library (Python version of chardet) detects it as cp1252

We should understand what is the mechanism of chardetect, through the cli. Because that's the right one Because via cli I have utf-8 with confidence 0.99.

Thank you @roll

shashigharti commented 1 year ago

@aborruso I was wrong because I did not consider rows that framework uses to predict the encoding.

Just to add up to what @roll has said, if I use only 500 rows in command line: head -500 finanziamenti.csv | chardet

it predicts(same as framework does): <stdin>: Windows-1252 with confidence 0.73

So if you run validation increasing the buffer size(5000/1000000) then it infers correct encoding and validation passes(as said above by evgeny): frictionless validate finanziamenti.csv --buffer-size 1000000

roll commented 1 year ago

What chardet do you use in CLI? (note that under the same name might be different implementations)

aborruso commented 1 year ago

The standard https://github.com/chardet/chardet

And I run simply "chardetect input.csv"

aborruso commented 1 year ago

@roll also via code and not cli, I have utf-8.

The sample code

import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob('finanziamenti.csv'):
    print(filename.ljust(60), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector. Result)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

roll commented 1 year ago

I think the difference is that Frictionless feed all the buffer (10000 bytes by default) to the chardet detector and at some point of this file some weird char that confuses chardet. It we reduce the buffer size it also detects utf-8:

frictionless describe tmp/finanziamenti.csv --buffer-size 100
frictionless describe tmp/finanziamenti.csv --buffer-size 1000