frictionlessdata / tabulator-py

Python library for reading and writing tabular data via streams.
https://frictionlessdata.io
MIT License
235 stars 42 forks source link

Make detection encoding customizable (charder/ccharder/filemagic/custom) #308

Closed mcarans closed 4 years ago

mcarans commented 4 years ago

Overview

Further to previously fixed bug, I have tried charset_normalizer and filemagic (which uses the library that drives file -i) on my link and the url in another closed Tabulator issue. On thsi very limited set, filemagic seems to work the best. eg.

from urllib.request import urlopen
import chardet
import cchardet
import charset_normalizer
import magic

with magic.Magic() as m:
    rawdata = urlopen('https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112').read()
    print(chardet.detect(rawdata))
    print(cchardet.detect(rawdata))
    print(charset_normalizer.detect(rawdata))
    print(m.id_buffer(rawdata))
    rawdata = urlopen('https://github.com/etalab/schema-irve/raw/v1.0.2/exemple-valide.csv').read()
    print(chardet.detect(rawdata))
    print(cchardet.detect(rawdata))
    print(charset_normalizer.detect(rawdata))
    print(m.id_buffer(rawdata))

gives:

{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}
{'encoding': 'UTF-8', 'confidence': 0.7524999976158142}
{'encoding': 'cp932', 'language': 'English', 'confidence': 0.9533799533799534}
UTF-8 Unicode text, with very long lines

{'encoding': 'ISO-8859-1', 'confidence': 0.6878846153846153, 'language': ''}
{'encoding': 'ISO-8859-1', 'confidence': 0.9274604320526123}
{'encoding': 'utf_8', 'language': 'Simple English', 'confidence': 1.0}
UTF-8 Unicode text

Please preserve this line to notify @roll (lead of this repository)

roll commented 4 years ago

I think we need to expose this choice to the user.

The filemagic was proposed already but it can't be a default because we can run into the same problems as we had with cchardet (MacOs/Anaconda/etc installations)

@mcarans WDYT about the interface? Adding the ability to customize detect_encoding or just having pip install tabulator[filemagic] with the preference list filemagic -> ccharet -> chardet at runtime?

mcarans commented 4 years ago

@roll It would be good to be able to choose what gets installed via pip and to be able to change the preference list with a sensible default like the one you have suggested based upon whatever libraries have been installed (filemagic -> ccharet -> chardet).

roll commented 4 years ago

@mcarans As I see it we would have:

pip install tabulator # chardet is default as it's only one real cross-platform option
pip install tabulator[cchardet] 
pip install tabulator[filemagic]

And I would add a detect_encoding option to the Stream (making the choice explicit; raise if the options is not installed):

stream = Stream(path) # chardet by default
stream = Stream(path, detect_encoding='cchardet')
stream = Stream(path, detect_encoding='filemagic')
stream = Stream(path, detect_encoding=(sample: bytes, encoding?: string) => encoding: string) # custom detect encoding funciton
mcarans commented 4 years ago

@roll that sounds great to me.

roll commented 4 years ago

It's implemented in our new Frictionless Framework,

please read for more information - https://github.com/frictionlessdata/tabulator-py#tabulator-py