dbcli / litecli

CLI for SQLite Databases with auto-completion and syntax highlighting
https://litecli.com
BSD 3-Clause "New" or "Revised" License
2.09k stars 68 forks source link

UTF-8 Decoding Error #113

Closed amjith closed 3 years ago

amjith commented 3 years ago

Description

Fixes #89

Turns out sqlite3 library for Python uses utf-8 by default which works fine since Sqlite3 stores everything as utf-8. But as you pointed out there could be invalid unicode values that can sneak in. Thankfully the python library allows overriding of the decoder that can be used. So I've caught the exception and applied latin-1 decoding.

Checklist

codecov-io commented 3 years ago

Codecov Report

Merging #113 (b71fb3d) into master (0baeadf) will increase coverage by 0.69%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #113      +/-   ##
==========================================
+ Coverage   62.34%   63.04%   +0.69%     
==========================================
  Files          23       23              
  Lines        1936     1986      +50     
==========================================
+ Hits         1207     1252      +45     
- Misses        729      734       +5     
Impacted Files Coverage Δ
litecli/main.py 48.60% <ø> (+0.46%) :arrow_up:
litecli/packages/parseutils.py 96.87% <ø> (ø)
litecli/packages/special/iocommands.py 54.51% <ø> (+3.73%) :arrow_up:
litecli/sqlexecute.py 70.40% <100.00%> (+1.49%) :arrow_up:
litecli/key_bindings.py 14.58% <0.00%> (-2.09%) :arrow_down:
litecli/clibuffer.py 33.33% <0.00%> (-1.97%) :arrow_down:
litecli/packages/special/main.py 88.23% <0.00%> (+0.93%) :arrow_up:
litecli/completion_refresher.py 76.56% <0.00%> (+1.98%) :arrow_up:
litecli/packages/special/dbcommands.py 38.62% <0.00%> (+2.19%) :arrow_up:
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a1a01c1...b71fb3d. Read the comment docs.

zzl0 commented 3 years ago

Update: this example shows the simplified version is better (in my opionion):

>>> b'\xf0\x9f\x98\x8a\x80abc'.decode('utf-8', 'backslashreplace')
'😊\\x80abc'
>>> b'\xf0\x9f\x98\x8a\x80abc'.decode('latin-1')
'ð\x9f\x98\x8a\x80abc'

Just realized the utf8_resilient_decoder function could be simplified, e.g.:

def utf8_resilient_decoder(b):
    return b.decode("utf-8", "backslashreplace")

Below examples are from https://docs.python.org/3/howto/unicode.html#the-string-type

>>> b'\x80abc'.decode("utf-8", "strict")  
Traceback (most recent call last):
    ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
  invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'
amjith commented 3 years ago

Yup. I like your solution better. I'll change the decoder to use the backslashreplace.

Thanks for the feedback.