Closed mcarans closed 4 years ago
Thanks I'll investigate
@mcarans I've fixed the size of the sample for detection of remote sources and this now works fine:
$ tabulator 'https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112'
@roll Thanks for fixing. I just wanted to ask about the change "Limit sample size for detection if remote" - if the character that caused the issue with chardet is at the beginning of the file, will there still be a difference of behaviour between chardet and cchardet?
@mcarans
TBH it's very confusing issue so I'm not sure it will be great if we can understand what went wrong and report this to chardet
. Can it be problems with the server (e.g. some weird ending byte)?
Yes it is indeed confusing that it works as a local file but not as a remote url. I can only presume that the sample sent to chardet is different for the local file to the remote url somehow.
@roll, It is odd chardet and cchardet give the same results when tested on the url outside of tabulator:
from urllib.request import urlopen
import chardet
import cchardet
rawdata = urlopen('https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112').read()
print(chardet.detect(rawdata))
print(cchardet.detect(rawdata))
gives:
{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}
{'encoding': 'UTF-8', 'confidence': 0.7524999976158142}
I'm not sure how Tabulator prior to your fix was using chardet in such a way that it behaves differently to cchardet on the url so cannot produce a cut down example to report against chardet.
Overview
A script failed with the new Tabulator 1.38.1 and I wondered why. I narrowed it down to the change from cchardet to chardet. For this file: https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112 cchardet has no issues but chardet gives:
I saw an issue https://github.com/frictionlessdata/tabulator-py/issues/265 where someone experienced the opposite: chardet works but not cchardet. Obviously I can set things up to use cchardet, but I'd like to understand a bit better the discrepancies you've found between chardet and cchardet.
Please preserve this line to notify @roll (lead of this repository)