TabularOutputFormatter chokes on values that can't be converted to UTF-8 if it's a text column

dbcli / litecli

CLI for SQLite Databases with auto-completion and syntax highlighting

https://litecli.com

BSD 3-Clause "New" or "Revised" License

2.52k stars 74 forks source link

TabularOutputFormatter chokes on values that can't be converted to UTF-8 if it's a text column #89

Closed ibsusu closed 3 years ago

ibsusu commented 4 years ago

I don't know how to solve this. If there is a single record with a non-convertible column it won't print anything.

Could not decode to UTF-8 column 'verifier' with text ''��U'` sqlite3 prints

2bho15FMrSQhKAYnjBqRQ1x4LS3zcHANsRjKMJyiSwT9|GnyZktv2SaCehfNCGjoYdAgAirxpCjvBCUXH6MiEHEH7                                                                                        
`'ŜU|`'ŜU

Seeing this was actually helpful because it notified me that I had garbage data, but I still would've thought the other rows would print.

WesleyAC commented 3 years ago

Just ran into this myself, it would be great if there was some option to fix this! Perhaps an option to print the error message in the cell (in red, so it's clear it's not that literal text?)

I'm working with the Firefox history database, so sadly removing the malformed data is not an option :(

amjith commented 3 years ago

Do you have an example value that I could use to reproduce this?

WesleyAC commented 3 years ago

I uploaded an example database file here: https://hack.wesleyac.com/test.sqlite

Using the invalid unicode value \xc3\x28. Let me know if that's sufficient for you :)

amjith commented 3 years ago

Thank you @WesleyAC. I was able to reproduce the issue. The fix is now in a PR (pending review from other core devs).

Long form description of what is going on:

Turns out sqlite3 library for Python uses utf-8 by default which works fine since Sqlite3 stores everything as utf-8. But as you pointed out there could be invalid unicode values that can sneak in. Thankfully the python library allows overriding of the decoder that can be used. So I've caught the exception and applied latin-1 decoding. Unfortunately this is a batch process which means, if a single value has an invalid byte value, the whole set has to use the fallback encoding of latin-1.

It seems to work well for now, but I can't use it to highlight the invalid value in red.

zzl0 commented 3 years ago

Unfortunately this is a batch process which means, if a single value has an invalid byte value, the whole set has to use the fallback encoding of latin-1.

Seems we can use decode('utf-8', 'backslashreplace') to avoid this issue:

>>> b'\xf0\x9f\x98\x8a\x80abc'.decode('utf-8', 'backslashreplace')
'😊\\x80abc'
>>> b'\xf0\x9f\x98\x8a\x80abc'.decode('latin-1')
'ð\x9f\x98\x8a\x80abc'

zzl0 commented 3 years ago

I just dived into this issue a little, the root cause of this is:

SQLite uses a dynamic type system (the type is recommended, not required), even though UTF-8 is the default encoding for TEXT type, but SQLite does not check if it's a valid UTF-8 string when inserting to it.
Python's sqlite library is using UTF-8 to decode the TEXT column by default. When it encounters invalid UTF-8 char, it throws UnicodeDecodeError: 'utf-8' codec can't decode byte ... error.

@amjith's CR fixed this issue by catching the UnicodeDecodeError and then try to decode it as latin-1.