Closed mcarans closed 4 years ago
I think we need to expose this choice to the user.
The filemagic
was proposed already but it can't be a default because we can run into the same problems as we had with cchardet
(MacOs/Anaconda/etc installations)
@mcarans
WDYT about the interface? Adding the ability to customize detect_encoding
or just having pip install tabulator[filemagic]
with the preference list filemagic -> ccharet -> chardet
at runtime?
@roll It would be good to be able to choose what gets installed via pip and to be able to change the preference list with a sensible default like the one you have suggested based upon whatever libraries have been installed (filemagic -> ccharet -> chardet).
@mcarans As I see it we would have:
pip install tabulator # chardet is default as it's only one real cross-platform option
pip install tabulator[cchardet]
pip install tabulator[filemagic]
And I would add a detect_encoding
option to the Stream
(making the choice explicit; raise if the options is not installed):
stream = Stream(path) # chardet by default
stream = Stream(path, detect_encoding='cchardet')
stream = Stream(path, detect_encoding='filemagic')
stream = Stream(path, detect_encoding=(sample: bytes, encoding?: string) => encoding: string) # custom detect encoding funciton
@roll that sounds great to me.
It's implemented in our new Frictionless Framework,
please read for more information - https://github.com/frictionlessdata/tabulator-py#tabulator-py
Overview
Further to previously fixed bug, I have tried charset_normalizer and filemagic (which uses the library that drives file -i) on my link and the url in another closed Tabulator issue. On thsi very limited set, filemagic seems to work the best. eg.
gives:
Please preserve this line to notify @roll (lead of this repository)