Strip BOM from text detected as UTF-16

frictionlessdata / tabulator-py

Python library for reading and writing tabular data via streams.

https://frictionlessdata.io

MIT License

236 stars 42 forks source link

Strip BOM from text detected as UTF-16 #194

Closed bz2 closed 7 years ago

bz2 commented 7 years ago

Depends on #193.

The name returned from cchardet includes the byte-order suffix, but for Python codecs that means do not handle the BOM. Switching to plain 'UTF-16' ensures the BOM is stripped.

Note that cchardet does not detect an encoding for UTF-16 files without a BOM, but the code ensures the mapping only happens when the expected BOM is actually present.

roll commented 7 years ago

@bz2 Thanks. It's very helpful! Released as tabulator>=1.4