atomic14 / diy-esp32-epub-reader

An ESP32 e-reader
MIT License
332 stars 45 forks source link

Parse the `application/x-dtbncx+xml` file and generate a table of contents #46

Closed martinberlin closed 2 years ago

martinberlin commented 2 years ago

This would be one very interesting option if it can be done. And it's what in another ePub readers comes as "Content". My best guess is that the manifest is listed:

<manifest>
    <item href="Text/Introduccion.xhtml" id="Introduccion.xhtml" media-type="application/xhtml+xml"/>
    <item href="Text/Capitulo1.xhtml" id="Capitulo1.xhtml" media-type="application/xhtml+xml"/>
    <item href="Text/Capitulo2.xhtml" id="Capitulo2.xhtml" media-type="application/xhtml+xml"/>
    <item href="Text/Capitulo3.xhtml" id="Capitulo3.xhtml" media-type="application/xhtml+xml"/>
</manifest>

The tricky part is that the title is not there. So I guess the xhtml should be read in order to extract is. As a result it should render:

Introduction Title of chapter 1, etc.

And same as in the book index, going up and down will draw a selected rectangle, and on SELECT should go directly open that html so you can go to a section directly. This will improve usability a lot since you should not start always from the beginning. But I know is a pretty tricky one to get working.

Additionally to this I'm happy about #28 but not entirely satisfied with it. This page state should be saved in a non volatile storage. Even a key value store in the NVS will do a much better job than saving it like this since it's lost on every reset.

martinberlin commented 2 years ago

@cgreening this in my opinion is the highest one to resolve if we want to have an Epub reader that is taken seriously. It's equally important to save state in the File-System and #26 so we can reset the device and still keep on reading where we left.

Even if we make it super nice, not being able to make this simple tasks converts it into an experimental reader but not in one you can actually use to read a book.

cgreening commented 2 years ago

We're already parsing the manifest and the spine sections of the XML - these bits are what tell us the order to play the book in. The spine lists the items in the order they should be shown and the manifest maps from item id to actual content.

The file that we could use is the toc item specified by the spine. This points to an item that has the table of contents - this looks pretty straightforward to parse and would let you generate a contents page - assuming all the books follow this standard.

manifest.xml

<?xml version="1.0"?>
<package version="2.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="BookId">

  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:title>Pride and Prejudice</dc:title>
    <dc:language>en</dc:language>
    <dc:identifier id="BookId" opf:scheme="ISBN">123456789X</dc:identifier>
    <dc:creator opf:file-as="Austen, Jane" opf:role="aut">Jane Austen</dc:creator>
  </metadata>

  <manifest>
    <item id="chapter1" href="chapter1.xhtml" media-type="application/xhtml+xml"/>
    <item id="appendix" href="appendix.xhtml" media-type="application/xhtml+xml"/>
    <item id="stylesheet" href="style.css" media-type="text/css"/>
    <item id="ch1-pic" href="ch1-pic.png" media-type="image/png"/>
    <item id="myfont" href="css/myfont.otf" media-type="application/x-font-opentype"/>
    <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
  </manifest>

  <spine toc="ncx">
    <itemref idref="chapter1" />
    <itemref idref="appendix" />
  </spine>

  <guide>
    <reference type="loi" title="List Of Illustrations" href="appendix.xhtml#figures" />
  </guide>

</package>

toc.ncx

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN"
"http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">

<ncx version="2005-1" xml:lang="en" xmlns="http://www.daisy.org/z3986/2005/ncx/">

  <head>
<!-- The following four metadata items are required for all NCX documents,
including those that conform to the relaxed constraints of OPS 2.0 -->

    <meta name="dtb:uid" content="123456789X"/> <!-- same as in .opf -->
    <meta name="dtb:depth" content="1"/> <!-- 1 or higher -->
    <meta name="dtb:totalPageCount" content="0"/> <!-- must be 0 -->
    <meta name="dtb:maxPageNumber" content="0"/> <!-- must be 0 -->
  </head>

  <docTitle>
    <text>Pride and Prejudice</text>
  </docTitle>

  <docAuthor>
    <text>Austen, Jane</text>
  </docAuthor>

  <navMap>
    <navPoint class="chapter" id="chapter1" playOrder="1">
      <navLabel><text>Chapter 1</text></navLabel>
      <content src="chapter1.xhtml"/>
    </navPoint>
  </navMap>

</ncx>
martinberlin commented 2 years ago

Yes exactly. We need to parse that file and extract the navMap.navPoint.navLabel.text content to get the title and the navMap.navPoint.content src property to get destination. I still didn't had a book list large enough to see how you are dealing with content lists that are bigger than the screen but here it should be the same technique as the book list applied. By the way please note that some books like "Dr. jekyll and Mr. Hide" have weird content src notations such as:

<content src="@public@vhost@g@gutenberg@html@files@43@43-h@43-h-2.htm.html#pgepubid00003"/>

I guess that does not matter if the file matches that name. It's just a pity is such an ugly construction for what it should be simply a filename. I think the anchor pgepubid00003 can be ignored since it's just the id of the body tag in the HTML.

cgreening commented 2 years ago

Sounds good. Agreed, should be similar to how we do the book list.

martinberlin commented 2 years ago

I will try to start with this even though I’ve no idea how to do it. But will be a nice introduction to xml parsing. Location of this file we should read from manifest I guess, although seems to be mostly the same /OEBPS/toc.ncx

<navMap>
<navPoint id="navPoint-1" playOrder="1">
<navLabel>
<text>Cubierta</text>
</navLabel>
<content src="Text/cubierta.xhtml"/>
</navPoint>
</navMap>

UPDATE: Needs to be read from manifest since not always this file is called toc, ex. in Future Noir book: 9780062852892_toc.ncx