aerkalov / ebooklib

Python E-book library for handling books in EPUB2/EPUB3 format -
https://ebooklib.readthedocs.io/
GNU Affero General Public License v3.0
1.49k stars 234 forks source link

add normalization to the files/chapters name #288

Open BassantAbdelaziz opened 1 year ago

BassantAbdelaziz commented 1 year ago

solve normalization issue

aerkalov commented 1 year ago

Thanks for this. I am just checking docs and it says "All file names within the same directory MUST be unique following Unicode canonical normalization and then full case folding". I am not that good with Unicode and I will have to read a bit more about it but do you know what would be this "full case folding" they are talking about?

BassantAbdelaziz commented 1 year ago

@aerkalov Thank you for your interest and reply. Allow me to explain the reasons behind the changes I made to the code.

I utilized the ebooklib library to process Arabic EPUBs and extract essential information from the opf file, such as the spine, manifest, publisher name, and read the content for each chapter. However, I encountered an issue with the file-name/chapter name, which was نهائي_الخبر_الرشيد. The library requires that the file name used to access items in the EPUB archive must match the actual file name present in the archive.

The error I faced was due to the presence of certain Arabic characters that required normalization, such as 'ئ' and 'ئ', to ensure consistency in the file names. Therefore, I implemented normalization for Arabic letters to handle these characters appropriately.

In Arabic, there are different ways to represent characters with diacritics, like Hamza and Madda, which can lead to inconsistencies in file names. The normalization process involves converting these characters to their base forms with specific diacritics, ensuring that the file names are standardized.

By normalizing the file names, I was able to resolve the error encountered while accessing items in the EPUB archive. This solution ensures that the specified file name in the code matches the actual file name in the archive, thus enabling smooth processing of Arabic EPUBs with accurate and consistent file names.

Thanks for this. I am just checking docs and it says "All file names within the same directory MUST be unique following Unicode canonical normalization and then full case folding". I am not that good with Unicode and I will have to read a bit more about it but do you know what would be this "full case folding" they are talking about?