celogeek / go-comic-converter

Convert CBZ/CBR/Dir into Epub for e-reader devices (Kindle Devices, ...)
MIT License
69 stars 10 forks source link

compressed package encoding bug #35

Closed ssbroad closed 4 months ago

ssbroad commented 4 months ago

When the program is processing a compressed package containing non-English folders, the progress information at the bottom after opening the comic will be garbled.

no folder : https://pixeldrain.com/u/4xUqyhxR non-english folder https://pixeldrain.com/u/zNucPhri

upscreenshot

Non-compressed packages and compressed packages that do not contain folders do not have this problem.

celogeek commented 4 months ago

Can you try to unzip the file, keep the subdirectory, then build the ePub using the directory. Of course you should see the character properly created when you unzip.

I can see it works when I do that with apple books, and fail using the zip directly. So it's a matter of properly reading the names.

I will have to check CBR also

ssbroad commented 4 months ago

Yes, errors will occur if you directly use a zip/rar containing non-English folders. I think this may be the program not handling Unicode encoding correctly when decompressing

celogeek commented 4 months ago

I try to zip again after unzipping with the zip tools from mac. And it seems the encoding inside works. The zip need to have UTF8 chars or CP437 (ASCII). If you are using another custom encoding, then it works but only on the same machine or the same language settings.

can you try this zip ? recompress.zip

celogeek commented 4 months ago

Here what I have when I use your fold:

2024/05/09 19:03:00 INFO img name="[\xb6\xfe\xebA\xcc\xc3\xd0\xd2][\xba\xcd\xd3\xea\xba\xcd\xc4\xe3]/double_p1_portrait-only.png" nonutf8=true
2024/05/09 19:03:00 INFO img name="[\xb6\xfe\xebA\xcc\xc3\xd0\xd2][\xba\xcd\xd3\xea\xba\xcd\xc4\xe3]/Dr.png" nonutf8=true
2024/05/09 19:03:00 INFO img name="[\xb6\xfe\xebA\xcc\xc3\xd0\xd2][\xba\xcd\xd3\xea\xba\xcd\xc4\xe3]/Dr2.png" nonutf8=true

And when I use the zip I recompress:

2024/05/09 19:05:01 INFO img name=x/[二階堂幸][和雨和你]/Dr2.png nonutf8=true
2024/05/09 19:05:01 INFO img name=x/[二階堂幸][和雨和你]/Dr.png nonutf8=true
2024/05/09 19:05:01 INFO img name=x/[二階堂幸][和雨和你]/double_p1_portrait-only.png nonutf8=true
celogeek commented 4 months ago

for some reason both have nonutf8=true flags ... so I'm a bit confuse.

celogeek commented 4 months ago

I confirm, the encoding in the zip is GBK:

$ echo -e '\xb6\xfe\xebA\xcc\xc3\xd0\xd2' | iconv -f gbk
二階堂幸
celogeek commented 4 months ago

Also, just in case, the option "-strip" will remove the first level in the zip. But if you have subfolder encoded in something different than UTF8, you will have the same issue.

celogeek commented 4 months ago

By the way, you can see the preview of the TOC with:

-dry -dry-verbose
ssbroad commented 4 months ago

This file can be displayed normally https://github.com/celogeek/go-comic-converter/files/15265259/recompress.zip

The problem is indeed with zip/rar that containing GBK folder.

But this problem doesn't actually have any impact on reading.

Thanks for your solution, adding the '-strip' option really works! The toc no longer contains garbled characters!

celogeek commented 4 months ago

May be I should have option to disable toc, or block convertion if the directory is not UTF8. This way, the user need to do something about it.

I may also add encoding option, like my filename is in GBK, so I can read properly the zip file.