Open aborruso opened 1 year ago
Hi @aborruso, Thank you for reporting!.
If I understand your question correctly, it should solve the issue:
frictionless validate finanziamenti.csv --encoding utf-8
If I understand your question correctly, it should solve the issue:
Ok, thank you, I know, but why doesn't it map it automatically as utf-8? Shouldn't it do it automatically?
Thank you again
Thanks! we will check.
Ok, thank you, I know, but why doesn't it map it automatically as utf-8? Shouldn't it do it automatically?
Moreover chardetect gives utf-8 with confidence 0.99
Thank you!
It seems to be a bug, I also checked and it is inferred as 'utf-8' with 99% confidence in command line, but same library in frictionless gives different result Windows-1252 (.73%)
Hi, there are two aspects:
chardet
) detects it as cp1252
unless we use a bigger buffer size frictionless describe tmp/finanziamenti.csv --buffer-size 1000000
(unfortunately, we can't fix the root cause on the Frictionless level)
- the underlying detection library (Python version of
chardet
) detects it ascp1252
We should understand what is the mechanism of chardetect, through the cli. Because that's the right one
Because via cli I have utf-8 with confidence 0.99
.
Thank you @roll
@aborruso I was wrong because I did not consider rows that framework uses to predict the encoding.
Just to add up to what @roll has said, if I use only 500 rows in command line:
head -500 finanziamenti.csv | chardet
it predicts(same as framework does):
<stdin>: Windows-1252 with confidence 0.73
So if you run validation increasing the buffer size(5000/1000000) then it infers correct encoding and validation passes(as said above by evgeny):
frictionless validate finanziamenti.csv --buffer-size 1000000
What chardet
do you use in CLI? (note that under the same name might be different implementations)
The standard https://github.com/chardet/chardet
And I run simply "chardetect input.csv"
@roll also via code and not cli, I have utf-8.
The sample code
import glob
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
for filename in glob.glob('finanziamenti.csv'):
print(filename.ljust(60), end='')
detector.reset()
for line in open(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print(detector. Result)
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
I think the difference is that Frictionless feed all the buffer (10000 bytes by default) to the chardet detector and at some point of this file some weird char that confuses chardet. It we reduce the buffer size it also detects utf-8
:
frictionless describe tmp/finanziamenti.csv --buffer-size 100
frictionless describe tmp/finanziamenti.csv --buffer-size 1000
Hi, I have this CSV file https://gist.github.com/aborruso/ec970c3a56596f9c014794466ce2f1d8
If I validate it via CLI I have
If I try to inspect it
I get nothing special, I don't see a strange character.
How can I use this validate error message, to clean this CSV file?
Thank you