Sigil-Ebook / Sigil

Sigil is a multi-platform EPUB ebook editor
GNU General Public License v3.0
5.97k stars 578 forks source link

[Bug]: Sigil non support saving Unicode file path for Manifest - Mannifest error message - ePub 3.0 for Indesign #754

Closed Rajce007 closed 6 months ago

Rajce007 commented 6 months ago

Bug Description

Sigil non support saving Unicode file path for Manifest

Some files everytime give me error message "Manifest error" message after first save edited epub in Sigil.

The symptoms are same like in https://github.com/Sigil-Ebook/Sigil/issues/448

A found that this Sigil Errors make me if original Adobe InDesign file was named with non-latin characters.

Workaround that worked for me:

Rename original indesign file without non-latin characters. After them is file in Sigil reopened correctly.

Platform (OS)

macOS

OS Version / Specifics

Sonoma 14.4.1 arm64 version

What version of Sigil are you using?

Sigil.app-2.1.0-Mac-arm64

Any backtraces or crash reports

No response

kevinhendricks commented 6 months ago

Sorry, I have used non-latin characters many times for paths in the manifest with no issues at all. So something else is going on that is leading to these manifest items being determined to be missing.

Did Indesign properly url encode the paths as dictated by the epub specification?

Did Indesign properly normalize the unicode before encoding the file paths to utf-8 strings before they url encoded them?

So please attach an exact screen capture of the message about files missing from the manifest.

Next, unzip your Indesign epub and provide a detailed listing of the actual paths to those files so a byte by byte comparison can be made.

Finally copy and paste the Indesign generated OPF (before loading it into Sigil) so that it can be checked to be properly generated.

kevinhendricks commented 6 months ago

And please copy and paste here the original Indesign generated OPF file, so I can try see exactly which unicode codepoints are involved.

kevinhendricks commented 6 months ago

One other potential cause is that the zip library that creates the .epub does not properly set the flag bit that indicates that the zip file name has been utf-8 encoded.

That is a common problem on Windows based systems unlike Linux and MacOS which tend to include/use the official zlib.

So please take your InDesign epub and strip it down to a single chapter with one of the problem files names and then replace that chapter contents with nonsense and save the epub and attach it here. That way nothing copyrighted is revealed and I have a testcase to test with on my own macOS dev box to track down what is happening.

Rajce007 commented 6 months ago

Hello Kevin,

here are files

error sreenshot

Snímek obrazovky 2024-05-05 v 10 14 52

epub3.0 directly from indesign indd_original_name_characters_ebook_before_Sigil.epub.zip

and same epub after first save in Sigil indd_original_name_characters_ebook_SigilSAVE.epub.zip

screenshot with list of files inside

Snímek obrazovky 2024-05-05 v 10 15 37

(files are inside, but not accesible for manifest)

I dont know how unzip epub file on my mac - i tried change extension from .epub to .zip, but it not works...

Tom

kevinhendricks commented 6 months ago

Thank you for the test cases. Yes, to open an epub manually (since an epub is just a specially constructed zip file) you just rename a copy of it from .epub to .zip.

Then to force unzipping it I use the command line unzip tool in Terminal.app Assuming your renamed epub is test.zip here are the steps.

  1. create a folder on your Desktop to unpack the zip inside up (to make it easy to delete afterwards, call it "mytest"

  2. copy the test.zip and put it inside your newly created "mytest" folder

  3. Open Terminal.app and use the following commands entered one per line followed by a return

cd cd ~/Desktop/mytest unzip test.zip exit

Inside the mytest folder on your Desktop, you will find the unpacked zip

But I can work with what you sent.

One question:

Did you create this InDesign epub on your mac or on some other Windows platform?

kevinhendricks commented 6 months ago

Okay, I took your indd_original_name_characters_ebook_before_Sigil.epub.zip and manually unzipped it to get the .epub back.

Then I opened Sigil with it. That epub is missing the xml document headers which Sigil automatically fixed and then it opened with no missing manifest files found at all. I was able to access every single file. I am on a macOS system with the older HFS+ case sensitive file system.

Is your mac by chance using the newer mac APFS file system?

The older HFS+ filesystem automatically created files names with Unicode Normalization NFD (with some minor variations based on the older Unicode standard). The newer APFS file systems no longer does Unicode NFD normalization.

The epub's OPF should be using NFC normalized utf-8 strings.

Rajce007 commented 6 months ago

Hello Kevin,ad One question:Did you create this InDesign epub on your mac or on some other Windows platform?The Indesign ePub file was created directly on the same Mac, where I used the Sigil  Indesign is in actual Adobe Cc2024 version  Tom5. 5. 2024 v 16:25, Kevin Hendricks @.***>: Thank you for the test cases. Yes, to open an epub manually (since an epub is just a specially constructed zip file) you just rename a copy of it from .epub to .zip. Then to force unzipping it I use the command line unzip tool in Terminal.app Assuming your renamed epub is test.zip here are the steps.

create a folder on your Desktop to unpack the zip inside up (to make it easy to delete afterwards, call it "mytest"

copy the test.zip and put it inside your newly created "mytest" folder

Open Terminal.app and use the following commands entered one per line followed by a return

cd cd ~/Desktop/mytest unzip -r test.zip exit Inside the mytest folder on your Desktop, you will find the unpacked zip But I can work with what you sent. One question: Did you create this InDesign epub on your mac or on some other Windows platform?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

Rajce007 commented 6 months ago

About MacOs file system. Yes, I bought new one MacBook Pro, with M3 Pro CPU, which probably can’t use other file system for working in Adobe CC. Tom

kevinhendricks commented 6 months ago

Interestingly, I did received the manifest missing error message from the SigilSAVE version of that epub.

Am I confused by the naming? I thought the direct from InDesign one was the one with the issues, but instead it appears to be the SigilSAVE one that shows the errors.

Are the names messed up or did the errors only happen after saving the file in Sigil?

Rajce007 commented 6 months ago

The errors only happen after saving the file in Sigil. So first open of this epub from indesign in Sigil was without manifest error.

kevinhendricks commented 6 months ago

Okay, that was not something I understood earlier.

The bottom line is something on the macOS side is encoding the manifest path hrefs as either Unicode Normalized to NFC form while the Zip's (the .epub) file entries are being Unicode Normalized to NFD form (or visa versa). So although the strings will appear to be exact matches, the actual byte order is different as one is decomposing some of the characters into base char and accent, while the others are using the composed single character form.

That causes the mismatch leading to the Missing Manifest errors.

In Normalizing Unicode strings, the mac in this case is ass backwards. It used to force everything to be Decomposed (but with special older Unicode rules). But the rest of the entire world including the web assumes that real Unicode Strings are in NFC (composed) form. Linux just assumes it is a byte sequence and does not really care which is bad too.

I have been reading the specs on the new APFS file sytem and it appears to no longer force things to the mac"s preferred modified NFD normalization form. So now you can have mixes where some strings can be in NFC form and some other strings can be in NFD form (when it comes to urls and paths).

That was probably a bad idea by Apple to make a change like that.

So I will need to fight with this a bit to see how best to force the file paths to use the same way of normalizing Unicode.

kevinhendricks commented 6 months ago

Upon further testing, the manifest entries produced when Sigil wrote the epub used the mac NFD variant while the zip container used the NFC variant. That caused the missing manifest error. The exact reverse could also possibly happen but I am unsure as I do not have a test for that case.

I will modify the macOS Export Sigil code to make sure the manifest entries are all normalized to NFC form. Hopefully that will prevent errors of this sort. This problem only exists on macOS platforms.

I guess that is why the epub people recommend sticking to ascii for file names as the number of different file systems used by all the e-readers plus the 3 major platforms is so huge and not all normalize the unicode strings in the same way.

kevinhendricks commented 6 months ago

It seems that Zip archive internal file names have no standard unicode normalization specification which is sheer madness.

So it seems we must force everything inside Sigil to Unicode NormalizationForm C given the zip container (.epub) could have been created on any type of platform and use any form of Unicode Normalization it wants to as well.

I have pushed a tentative fix for this to master. But it really needs to be tested heavily before the next release.

kevinhendricks commented 6 months ago

Thank you for your bug report and test cases. I will leave this issue open until a fully tested fix is in place and has been made.

kevinhendricks commented 6 months ago

I have been testing this on both my arm64 Mac Studio and my i7 MacBookPro and these changes seem to work and not cause any unpleasant side effects that I can detect.

So closing this issue as fixed.