[Bug]: 2.2.0, 2.2.1, Arabic, Hebrew, letters are modified for no reason after saving.

johe123qwe commented 5 months ago

Bug Description

This is the test result of 2.2.0 and 2.2.1, comparing the original text and the sigil after saving. After saving, some characters are modified. The modified content is different from the original text.

2 2 0

After testing, 2.1.0 has no problem. 2 1 0

This is the original content. إن تقييم الله لأيوب مسجّل في العهد القديم: "لِأَنَّهُ لَيْسَ مِثْلُهُ فِي ٱلْأَرْضِ. رَجُلٌ كَامِلٌ وَمُسْتَقِيمٌ يَتَّقِي ٱللهَ وَيَحِيدُ عَنِ ٱلشَّرِّ" (أيُّوب 1: 8)

Platform (OS)

macOS

OS Version / Specifics

12.7.5

What version of Sigil are you using?

2.2.0 2.2.1

Any backtraces or crash reports

No response

kevinhendricks commented 5 months ago

Since Qt is identical across those two versions (2.1.0 to 2.2.0), the only change that could possibly be related in Sigil 2.2.0 requires all text internal to Sigil to use Unicode Normalization Form C (NFC) (Composed) which is common for most Linux and Windows platforms. MacOS uses an older unicode version of NFD as its internal Unicode Normalization form (Decomposed).

This change was made to prevent seemly identical file names from the OPF and the Zip container from comparing as different when in fact they were not (one used form NFD while one used form NFC) which forced the change.

Please attach a simple example epub generated with this single line typed directly (no cut and paste) in Sigil 2.1.0 and saved. Attach a second example epub which is the first read in and saved by either Sigil 2.2.0 (or 2.2.1).

Then reverse the procedure by directly typing that same line fresh in Sigil 2.2.0 and saving it. Then loading that new epub directly into Sigil 2.1.0 and saving it.

From those test cases I should be able to see where in the process of unicode normalization changes this change is being introduced ie. is it zipping or unzipping of the epub container, or if something else inside Sigil/Qt is responsible.

johe123qwe commented 5 months ago

Thank you for your prompt reply. I appreciate your help in solving this problem.

I have attached the sigil 2.1.0 epub and sigil 2.2.1 epub. This is a simple example.

epub.zip

johe123qwe commented 5 months ago

I edited my previous reply. I have compressed both epub files.

kevinhendricks commented 5 months ago

Thanks.

Unfortunately, I only have access to my MacBook Pro Intel laptop until Tuesday.

But if you are using a version of MacOS that uses Intel x86_64, I would be happy to create a test version of Sigil with just the new call to unicode normalize each file on initial load removed and upload it to my repo for you to grab. If that single change "fixes" the issue for you then we know for sure we have the culprit. I will not understand why "yet" but it will give us a starting point.

If instead the problem persists then we will have to dig deeper.

Until then, I will start by trying to recrete what you are seeing on my laptop. Unfortunately what I know about RTL languages like Hebrew and Arabic would not fill a thimble.

Hopefully visually I will be able to see the differences you point out in your bug screenshots.

Either way we will track this one down.

kevinhendricks commented 5 months ago

Okay, I grabbed Sigil 2.1.0 from releases and loaded your correct Sigil-2.1.0.epub. Then I created a Checkpoint of the epub.

Next, I grabbed Sigil 2.2.1 from releases and loaded your correct Sigil-2.1.0.epub. Then I ran a diff which compared them.

The diff said that the utf-16 char sequences in the line changed. Which is okay and expected as one utf-8 byte sequnce was stored as macOS NFD normalize (decomposed order) and one is stored at the utf-8 byte sequence that is NFC normalized (composed). This is okay as there are multiple sequences of unicode codepoints that can combined (ie diacritics added) that create a final unicode character.

The real question is if there are differences in appearance on the screen.

I took two screen shots (one of the original in Sigil 2.1.0 and one from Sigil 2.2.1.

I have attached them.

I just can not see any on-screen character differences. Please point out to me where exactly these two screen shots differ ina way that is significant to the language and not just by the differences in the order of how multiple diacritics are combined into the final character.

kevinhendricks commented 5 months ago

I really need to see the changes visually to track down this bug. Neither of your test epubs have the dir=rtl attribute of hebrew language on the html tag. Does that impact the problem? Should I add them?

kevinhendricks commented 5 months ago

I re-ran this test just in case I messed up the old test somehow. This is a procedure. Please verify the results with your exact posted test epubs:

Use Sigil 2.1.0 and open your testcase sigil2.1.0.epub. Once open immediately make a Checkpoint of the pub (use the Sigil-s Checkpoint Menu), then close out of Sigil and DO NOT save any changes.
Make sure you have completely exited out of Sigil (not running in the background)
Using Sigil (2.2.0 or 2.2.1 - your choice) open up your sigil2.1.0.epub testcase once again. Now immediately go to Sigil's CheckPoint Menu and compare differences. When it prompts you for a checkpoint choose V001 (there should only be one anyhow). Then Sigil will show a Window of all files that changed. Select the xhtml file from that window and examine the comparison.

Do you see any characters in darker green (meaning a specific change) in that line on the right panel?

I have repeated this now 3 times getting different results twice. Once it showed only light green meaning that the line bytes were different but ... two other times it showed a missing dot on what looked like a "j" that was to the immediate right of the " character.

The more times I try this the more seemingly random changes I am seeing.

I think the Qt string normalization routine is broken in some random way on macOS.

I will try a third case with a build of Sigil-2.2.1 with the "normalized NFC taken away.

Please see if you can confirm any of what I am seeing.

kevinhendricks commented 5 months ago

Okay, the issue seems to be in the utf-8 byte order being written to disk being interpreted as unicode form NFC sometines and sometimes NFD by third party code (python and perl) which is what the Sigil Checkpoint functions use (Checkpoint and Compare (python).

That is a different problem then actually changing the content inside of the epub (other than normalization form).

That said, if I copy and paste your original content line written above line above into Sigl-2.1.0, then save it. Open the newly saved file in Sigil 2.2.1 and paste in a new copy of your original content line, then save it, then repeat that one more time and then load it into Sigil 2.2.1 yet again and paste in a fresh copy of your original content line. and look at all 4 lines in Sigil's Previewer here is what I see:

kevinhendricks commented 5 months ago

I simply can not see any difference in those 4 lines.

The issue is using Unicode Form C for all TEXT in an epub is actually part of the official epub spec. So converting a read in text file to use Unicode Form C must be part of the input process.

I am really stumped here. My compare tools are saying one thing (and utf-8 byte order for combining diacritics is allowed to vary) but my eyes are saying something else completely.

kevinhendricks commented 5 months ago

To make things even more interesting. I started with your sigil2.1.0.epub and extracted the Section0001.xhtml into a file I called good.xhtml

Then I loaded sigil2.1.0.epub into Sigil 2.2.1, then saved it to junk.epub

I then extract Section0001.xhtml from junk.epub and called it bad.xhtml.

When I compare them character by character inside any Sigil they appear the exact same. When I diff them I can see that the NFC normalized one uses a different combining order but one that results in the exact same output.

Then I ran a python program that would take a file and use unicodedata.normalize('NFC', data) to convert good.xhtml to good_nfc.xhtml.

Then I diffed good_nfc.xhtml and bad.xhtml and they were now byte by byte identical.

So the Qt QString::normalize function works exactly as expected (no randomness at all). This is 100% repeatable. The epub spec says all text should use UnicodeNormalizationForm C (which Sigil 2.2.0 and later now do). And that conversion is done properly.

Now command line compare tools that try to show the differences in a gui seem to freak out over a unicode normalization change (as does Sigil's own CheckPoint compare). But the results inside Sigil in both the CodeView Window and the Preview Window show the proper results (as far as I can tell).

kevinhendricks commented 5 months ago

Okay, I have spent more time testing things and I do not think there is any bug here at all (except in Sigil's Checkpoint Compare display window). The epub spec called for all unicode content files to use Unicode Normalization Form C (NFC). A change in byte order does happen when you go from Unicode Normalization Form D (NFD) to NFC, but only for combining unicode codepoints (accents and things that can be combined with base characters to form new ones). That difference in order does not impact the final character in any way. It just standardizes the sequences of combining characters so normal string searches and comparisons actually work in a cross-platform manner.

Wikipedia as a nice article on Unicode Normalization Forms and how they work and of course so does the official unicode standard.

As for why it impacted files when you did not "touch" them, is because Sigil is an epub editor not a simple file editor. When you open any epub it probes each page for its character encoding and recodes it to unicode (in this case utf-16), Then is makes sures each file uses unicode normalization form C, removes Windows file endings, and loads each page into a QPlainTextEdit widget ready for editing. This allows search and replace across all css, xhtml, and other files. All files are saved when the epub is saved.

Normally, if nothing is edited in the file, and it was already properly unicode encoded, the file written out is identical to the original. In this case a bug fix to make sure the files are NFC encoded was added in Sigil 2.2.0. Most Western languages that are Ascii based are actually always in NFC form. And Windows and Linux primarily use NFC form for input. So it primarily only a macOS that plays with NFD string encoding these days. But without that bug fix, a user on Windows who tries to read your epub and search for any word that used combining accents would come up empty, since the byte order used under NFC form would differ from the byte order under your NFD (or mixed) form.

Thus the need for the bug fix to convert mixed form to NFC form as per the spec in Sigil.

So I am closing this as not a bug, since once you move to Sigil 2.2.0 and later, your xhtml and css code is now properly formatted and will not change again.

Feel free to post additional questions here. Or provide evidence that what you see in the CodeView Window or the Preview Window of Sigil 2.2.0 or later is actually incorrect in some way (a real bug) due to the normalization conversion. It is a different byte order but the exact same unicode chars result, and allow cross-platform search and replace to work.

johe123qwe commented 5 months ago

@kevinhendricks Thank you for your patient response. Due to work commitments, I only saw your reply today. It might be a bit late, and I hope you understand.

I carefully read through each of your responses. I have a question to consult with you.

I tested Sigil 2.2.1. This operation was completely carried out within my Windows 11 virtual machine. When I copied the content from Google Docs into Sigil 2.2.1 and saved it, I found different content when I used difff.jp to compare. I copied the content from Google Docs again, and when I searched in Sigil, I did not get any results. I have been using Sigil for many years and did not encounter this situation before version 2.1.0. I look forward to your response.

I have attached the Google Docs, the comparison results, and the epub.

https://difff.jp/en/s2tjs.html

https://docs.google.com/document/d/1wr5N2m6DiYeFr8rBKz0IZLZ3oL4YDgmc/edit?usp=sharing

testsigil.zip

sigil2 2 1

kevinhendricks commented 5 months ago

Did you type the text into the Find field? From what you wrote you copied it from a Google Doc that could have copied it from someplace else.

If you hand typed that in that means your keyboard input is generating things in NFD unicode form (which is something I have never seen but is technically possible).

If you copied and pasted it from someplace else, the place you cut and pasted from is using NFD normalization for the text (possibly a document from the macSide or pre Sigil 2.2.0? I can't tell.

As for using a Windows VM on a mac, I have no idea if it defaults to NFD form or NFC since macs use NFD (at least a variant of it).

If you create on Sigil 2.2.1 and your keyboard input device generates NFD form, then some text generated will be NFC form and some text generated will be NFD form. That is always going to cause problems for search. That was the whole reason the original epub standard required NFC form.

Sigil 2.1.0 and earlier never enforced NFC form. So to test things properly use Sigil 2.1 on a real Windows box and hand type in that line and save it. Do not cut and paste from some external source, Do not open the epub again. Post it here, and I will unzip it down to the files without involving Sigil at all, to see what Unicode Normalization Form was used. If it turns out to be NFD, then that must be a feature of the keyboard input device for hebrew and arabic and not just on mac.

If that is the case then forcing NFC norm to meet the spec is not a good idea as your document would be mixing NFD and NFC forms. So we would need to change Sigil in some way to only run the NFC normalization once after the epub is complete, just before it is published.

If so, I will add a Sigil Preference setting to enable and disable it, but then the onus is on the epub dev to meet the spec for publishing.

This whole unicode normalization form is a real mess. Why they designed things with multiple sequences that generates identical characters in the end just makes no sense at all to me. But that is what you get when you design things by intl committee.

So no rush, when you get a chance please help determine what unicode normalization form is generated by typing on a Windows box in hebrew or arabic. If it is NFD, then I must change things. We already know that on a mac it is NFD. In other words if hebrew and arabic require NFD on all platforms then my last fix needs to be altered in some way.

From the sounds of things most software does not seem to care and will happily mix NFC form text snippets with NFD form text snippets and hope that no one tries searching it.

kevinhendricks commented 5 months ago

Hmm, just found this at unicode.org. It seems Biblical Hebrew and Arabic are different:

[QUOTE] Q: But isn't there is still a problem with Biblical Hebrew? There was a problem, but it has been addressed. Because the Hebrew points are defined to have distinct combining classes, their character semantics is such that their ordering is immaterial in the standard. To handle those cases where visual ordering is material, see the discussion of the Combining Grapheme Joiner (CGJ) in Section 23.2, Layout Controls, in the Unicode Standard. [/QUOTE]

And there is a similar passage for Arabic, p so it seems some changes that are language specific are needed.

johe123qwe commented 5 months ago

This is the result of my test on a real Windows computer.

I can't type the content manually because I don't know Arabic. I'm sorry.

I used version 2.2.1, copied and pasted the content from Google Docs to sigil. After saving, I still can't search the content.

test2.1.zip

https://difff.jp/azk24.html

2024-06-25 16-01-06

2024-06-25 15-53-37

kevinhendricks commented 5 months ago

Unfortunately, I was looking to determine what the keyboard input produced (was it in NFD or NFC form) for those languages on a Windows or Linux machine. When you copy and paste from another document whose origin is unknown and unicode normalization is unknown, I have no way of telling what the unicode normalization form actual keyboard input produces.

Please simply stick to using Sigil 2.1.0 until I can collect enough information to understand what native keyboard input produces for Hebrew and Arabic (is it Unicode Normalization Form C or D) on Windows and Linux boxes. Until I know that I can not determine the right way to change things and still produce epubs that meet the epub standard.

kevinhendricks commented 5 months ago

I posted information on our User Forum to ask for help. People with more knowledge than I with Arabic have confirmed that NFC normalization does not hurt the text, nothing is lost, etc. Unfortunately you are cutting and pasting from a document that uses decomposed form (NFD) form, which is what is causing your later search issues.

The only sane way to handle text where there are multiple sequences that can create the same unicode character in the end for Searching/Find and Replace is to convert it all to use the same unicode normalization form (use the exact same sequence for all occurrences). In this case that is NFC. This includes normalizing text pasted into CodeView and into the Find and Replace field as well (in fact converting on every access).

This will prevent the issue you are seeing and reported when we re-opened this issue.

I have now got a working version of Sigil that more thoroughly handles converting to NFC form.

In your particular case I recommend staying with Sigil 2.1.0 until the next version of Sigil is released since you are constantly copying and pasting from NFD form data into Sigil's CodeView and then into Find and Replace. It is better for the document to use all NFD form than to mix NFC and NFD form until the new version of Sigil comes out that will enforce NFC more completely.

Hope this helps.

Thank you for your bug report.

johe123qwe commented 5 months ago

I posted information on our User Forum to ask for help. People with more knowledge than I with Arabic have confirmed that NFC normalization does not hurt the text, nothing is lost, etc. Unfortunately you are cutting and pasting from a document that uses decomposed form (NFD) form, which is what is causing your later search issues.

The only sane way to handle text where there are multiple sequences that can create the same unicode character in the end for Searching/Find and Replace is to convert it all to use the same unicode normalization form (use the exact same sequence for all occurrences). In this case that is NFC. This includes normalizing text pasted into CodeView and into the Find and Replace field as well (in fact converting on every access).

This will prevent the issue you are seeing and reported when we re-opened this issue.

I have now got a working version of Sigil that more thoroughly handles converting to NFC form.

In your particular case I recommend staying with Sigil 2.1.0 until the next version of Sigil is released since you are constantly copying and pasting from NFD form data into Sigil's CodeView and then into Find and Replace. It is better for the document to use all NFD form than to mix NFC and NFD form until the new version of Sigil comes out that will enforce NFC more completely.

Hope this helps.

Thank you for your bug report.

Thanks for your reply. Looking forward to your next release. Wish you all the best.

kevinhendricks commented 4 months ago

I would appreciate it if you could please test and report back results testing with your Arabic and Hebrew epubs with this Beta test build of Sigil that includes changes to sync Find Replace and CodeView under unicode normalization form C.

https://www.mobileread.com/forums/showpost.php?p=4436549&postcount=1

Please let me know if you run into any anomalous behaviour. It should adress the issues you brought up.

johe123qwe commented 4 months ago

Sorry, I don't know how to get your message in the first place. I just saw your message when I came to read the post.

I tested 2.3.0 and there was no problem. I copied the content and I can search the results in sigil. Thank you for your excellent work. Thank you! 👍

ar-diff ar-sigil

dougmassay commented 4 months ago

Messages are typically forwarded to the email address associated with your GitHub account. Perhaps you turned off notifications? Anyway .... all's well that ends well. Thanks for reporting back.

On Thu, Jul 11, 2024, 3:59 PM johe123qwe @.***> wrote:

Sorry, I don't know how to get your message in the first place. I just saw your message when I came to read the post.

I tested 2.3.0 and there was no problem. I copied the content and I can search the results in sigil. Thank you for your excellent work. Thank you! 👍

ar-diff.png (view on web) https://github.com/user-attachments/assets/79be491a-ff1a-4488-9371-f2530bb88658 ar-sigil.png (view on web) https://github.com/user-attachments/assets/ec90660c-fd4e-47be-b5b3-89b7e18b03bc

— Reply to this email directly, view it on GitHub https://github.com/Sigil-Ebook/Sigil/issues/758#issuecomment-2223815732, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG3CXTD5G3SIYS2TCZRZ6DZL3PY3AVCNFSM6AAAAABJYELCNKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTHAYTKNZTGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

johe123qwe commented 4 months ago

Hi, I tested and found that version 2.3.0 will modify some punctuation marks. I found this when testing Greek. Here are the comparison results. The left is the original text, and the right is the result after the sigil is saved. I tested 2.1.0 and there is no such problem.

https://difff.jp/xxzmx.html

kevinhendricks commented 4 months ago

Yes, just as before using NFC normalization form will pre-compose all characters. The final rendered version will be identical.

In your Greek example I could not see any resulting difference.

If there is one please blow up the image and circle the differences.

Your diff is comparing NFC form to an NFD form example which will always compare as different.

kevinhendricks commented 4 months ago

Oh it is the height of the center circle that is in the very last character. This is expected behaviour given the latest unicode standard.

See: https://www.unicode.org/review/pri491/

johe123qwe commented 4 months ago

Thank you for your excellent work. I asked ChatGPT about the difference between NFC and NFD.

NFC: -Typically used for storing and transmitting text because it normalizes characters into a single precomposed form, reducing the number of characters.

Suitable for text display and user interface processing because precomposed characters are generally easier to handle.

NFD:

Typically used for text processing and sorting operations because the decomposed form breaks down each character into its most basic parts, allowing for easier character-by-character comparison.
Suitable for scenarios that require fine-grained processing and analysis of text content.

In my work, I often search and compare texts. We need the original text to be consistent with the content output by sigil. However, NFC will cause some characters of the original text to be modified. I can't search the content by the original text. Can NFD be retained in future versions of sigil? Or do you have any suggestions? Thank you. 🤝

kevinhendricks commented 4 months ago

If you do search NFC is the recommended form, not NFD. The only thing easier to do using decomposed (NFD) form is to upper and lower case letters since all accents are outside of the base character.

So that ChatGPT response was very poor.

Please note all urls from links and filepaths must be in NFC form according to the epub3 spec. Sigil will not violate that. For epub2 the required form is NFC for all content documents.

I will work on a way to disable non-essential conversion to NFC via an environment variable but using it would mean that many e-reader devices would fail when searching for any accented text in your epubs. That is the whole reason for standardizing on NFC to begin with so that everyone normalizes unicode in the same way to make searching possible.

Sigil-Ebook / Sigil