Pascal-Krenckel / NppGZipFileViewer

A Notepad++ plugin to open and save files in the gzip format.
Apache License 2.0
5 stars 2 forks source link

Incorrect Text Encoding #1

Closed rzcat closed 2 years ago

rzcat commented 2 years ago

Hi, thanks for making this plugin I was looking for a plugin that can deal with xz compressed text, something like xzcat, but ended up finding this, it's been very useful

Problem: I got a txt file containing asian characters, in UTF-8, opened with Npp, displayed OK, with encoding in UTF-8 automatically selected. Now, gzip it with 7-zip, open the result *.gz file with Npp, everything is a mess, encoding in ANSI automatically selected. If I manually click to tell Npp this file's encoding is in UTF-8, all characters display correctly

It would be great to have decompressed text displayed in correct encoding :)

Pascal-Krenckel commented 2 years ago

It seems that npp has problems to automatically detect the encoding if the file is UTF-8 encoded with asian characters. I added an option to open all ANSI files (zipped) as UTF-8. I didn't test it much, yet.

rzcat commented 2 years ago

Thanks for the quick response. On my side, npp(v7.9.4 x64) has no problem detecting the encoding of original plain text automatically, everything displayed good The text file aforementioned is in UTF-8, not UTF-8-BOM With your newer release(v1.2.0), and "Open ANSI as UTF-8" checkbox CHECKED, the content of gzipped file displayed correctly. Somehow it works.

However, I convert this txt file from "UTF-8" to "UTF-8-BOM" using npp, the size grows bigger slightly, then gzip it With your newer release, and "Open ANSI as UTF-8" checkbox UNchecked, the gzipped file still displayed like a mess. Manually click "UTF-8-BOM" in the "Encoding" menu of npp will correct the display

I test more I convert this txt file from "UTF-8" to "UCS-2 LE BOM" using npp, then gzip it using 7-zip With your newer release, and "Open ANSI as UTF-8" checkbox UNchecked, the gzipped file displayed like a mess. But, manually click any item in the "Encoding" menu of npp WON'T correct the display now If I unzip the gz file back to txt, using 7-zip, all displayed good, so the compression is lossless The windows notepad says the txt is in "UTF-16 LE", while npp says it's in "UCS-2 LE BOM"

It seems like the decompressed content is corrupted? In the final test case above, I feel the content decompressed is missing something too much since selecting the encoding manually cannot make it correct

The following line, which I use for test, is a movie title containing Japanese, English and Chinese characters. Save it using windows notepad and see if the problem exists, if you have free time :)

コンフィデンスマンJP 英雄編

Pascal-Krenckel commented 2 years ago

Thx, the decompresseion is not corrupted. The internal text editor of npp, scinitilla, only supports utf8 and not utf16. I just coppied the data, so the encoding didn't match. Basically Npp reads the file, detects the encoding and (if needed) converts it to utf8. When the file is saved npp converts it back. If only utf-8 is used (scintillas buffer encoding and the files encoding) npp just coppies the data. So it seems I can not support utf-16 automatically since npp will try to convert the compressed file (just a byte array) as UTF-8 to UTF-16. I keep track of the file encoding and convert it to utf-8 for scintilla and just say to npp that it should use the utf-8 view.

rzcat commented 2 years ago

Danke, v1.2.1 works Save a txt in UTF-16, gzip it, open .txt.gz with Npp, Npp says it's in UTF-8 now edit directly, adding some more characters, click save in Npp, close Npp ungzip the modified .txt.gz, open the extracted txt with Npp, it's in UTF-16