Retrieve epub documents in spine order

prydom commented 4 months ago

The epub to text conversion was not extracting text in anything resembling chapter order on a few epubs that I tried it with.

The root cause is that the items listed in the manifest are not similar to the order listed in the book's spine.

Below is an excerpt of a content.opf file from one of the books. Notice how all the content inserts are listed before any chapters but should be interleaved between chapters. Also notice that the inserts are not even in order (in the below example the sequence is 6,1,2,3,4,5,7,8 in the manifest).

To resolve this issue, we iterate though the content IDs in the spine and index into a dictionary indexing all manifest items of type ebooklib.ITEM_DOCUMENT.

  <manifest>
    <item id="cover" href="Text/cover.xhtml" media-type="application/xhtml+xml"/>
    <item id="style" href="Styles/stylesheet.css" media-type="text/css"/>
    <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
    <item id="toc" href="Text/toc.xhtml" media-type="application/xhtml+xml" properties="nav"/>
    <item id="tocimg.xhtml" href="Text/tocimg.xhtml" media-type="application/xhtml+xml"/>
    <item id="insert6.xhtml" href="Text/insert6.xhtml" media-type="application/xhtml+xml"/>
    <item id="insert1.xhtml" href="Text/insert1.xhtml" media-type="application/xhtml+xml"/>
    <item id="insert2.xhtml" href="Text/insert2.xhtml" media-type="application/xhtml+xml"/>
    <item id="insert3.xhtml" href="Text/insert3.xhtml" media-type="application/xhtml+xml"/>
    <item id="insert4.xhtml" href="Text/insert4.xhtml" media-type="application/xhtml+xml"/>
    <item id="insert5.xhtml" href="Text/insert5.xhtml" media-type="application/xhtml+xml"/>
    <item id="insert7.xhtml" href="Text/insert7.xhtml" media-type="application/xhtml+xml"/>
    <item id="insert8.xhtml" href="Text/insert8.xhtml" media-type="application/xhtml+xml"/>
    <item id="TOC.jpg" href="Images/TOC.jpg" media-type="image/jpeg"/>
    <item id="chapter1_2.xhtml" href="Text/chapter1_2.xhtml" media-type="application/xhtml+xml"/>
    <item id="chapter2_1.xhtml" href="Text/chapter2_1.xhtml" media-type="application/xhtml+xml"/>
    [...]
  </manifest>
  <spine page-progression-direction="ltr" toc="ncx">
    <itemref idref="cover"/>
    <itemref idref="tocimg.xhtml"/>
    <itemref idref="characters1.xhtml"/>
    <itemref idref="characters2.xhtml"/>
    <itemref idref="chapter1.xhtml"/>
    <itemref idref="insert1.xhtml"/>
    <itemref idref="chapter1_1.xhtml"/>
    <itemref idref="insert2.xhtml"/>
    <itemref idref="chapter1_2.xhtml"/>
    <itemref idref="chapter2.xhtml"/>
    <itemref idref="insert3.xhtml"/>
    <itemref idref="chapter2_1.xhtml"/>
    [...]
</spine>

prydom commented 4 months ago

Note that my editor had added a missing import and cleaned up some trailing whitespace. If this is not desired I can rebase the diff, please let me know.

aedocw commented 4 months ago

I'll try to take a look at this later today, thanks for the submission!

aedocw commented 4 months ago

This is really nice! I have not run into the issue you mentioned, but this approach is much cleaner and is a great improvement. Thanks for the PR, really appreciate it!

aedocw commented 4 months ago

One thing - can you bump the version in setup.py to 1.2.2? That's all this needs before I'm ready to merge :)

prydom commented 4 months ago

Done and rebased onto main.

aedocw / epub2tts-edge

Retrieve epub documents in spine order #30