coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.39k stars 1.84k forks source link

Trying to Convert Generated HTML to a Fixed Layout EPUB 3 #624

Open YenForYang opened 8 years ago

YenForYang commented 8 years ago

I realize that pdf2htmlex can't make reflowable text from PDFs atm, but I've been wanting to generate ebooks with help of pdf2htmlex and/or some other tools from PDFs to some ereader format. For now I see that a fixed-layout EPUB is the only semi-viable opton that hopefully can temporarily satisfy my needs.

Basically, I've been trying to figure out a way to configure pdf2htmlex to generate xhtml documents that can be imported easily into an EPUB3 (at the moment I've been trying to get it to work with Sigil). I've been looking at Eric Dodémont's book called "Fixed Layout ePub: A Practical Guide to Publish eBooks from PDF Files" (I actually have a PDF of it, so if @coolwanglu wants to a look or anyone else I can send them a copy perhaps--some basic info/introduction about the topic is here) the book (and the webpage I guess) has some useful but rather basic information on getting pdf2htmlex to generate the files, but the method is essentially outlined as:

Method 2.1: one XHTML file per page with: • one JPG or PNG bitmap image including the bitmap and vector images; • one layer of visible text on top of the image (accessible text); • the font files.

Method 2.2: one XHTML file per page with: • one SVG file including the bitmap and vector images; • one layer of visible text on top of the image (accessible text); • the font files.

Method 2.3 (1): one XHTML file per page with: • one SVG file including the bitmap images, the vector images, and the text and fonts.

(note that this is considered one method)

It is mentioned, however, that the pages generated from pdf2htmlex require a deal of further editting, (the editting is not automated at all really, but I have found workarounds for these............but perhaps pdf2htmlex can be configured to not require it...?). Basically, the split *.pages generated from pdf2htmlex have to be prepended and appended with some html code so that each .page can act like a seperate xhtml page (pretty much like the main html page, but each page has its own page-container and etc.). Also, The HTML viewer that pdf2htmlEX produces and integrates in the result has to be removed (not quite sure how to do this, but think I do in a way---do I edit the manifest file in the share directory in some way (I'm not too sure what I can and cannot touch in that file, or what to edit out). Other things that won't be used include:

  1. the pdf2htmlex.min.js and compatibility.min.js (I think they aren't needed, but need confirmation as I am not entirely sure what these do)
  2. fancy.min.css (if possible, this might be useful to keep, but the book says not to keep it for some reason--what is in this file that makes it "fancy"?)
  3. the logo png file
  4. the main html file (some code in it is used to wrap the individual xhtml pages, and some code in it may be useful for the content.opf and table of contents perhaps)

One thing that I really really want to figure out is how to integrate is the .outline file in the epub. I cannot figure out how to get the links to work from the outline (I renamed .outline to .html and clicked the links, but, sadly, they did not work.) If I could use the .outline to create a linked table of contents (in other words, to create the nav.xhtml file) it would save SO much time. There has to be some way to modify it to get it working with epubs right--perhaps the page id's/no's need to be changed to work with seperate xhtml pages? The nav.xhtml file basically has to end up looking like this (which is different from the .outline file):

<?xml version=' l.0' encoding='UTF-8'?>

<head>
    <title>The Best Book in the World</title>
</head>

<body>
    <nav epub:type="toc" id="toc">
        <ol>
            <a href="mybook3.xhtml">Title page</a>
            </li>
            <li>
                <a href="mybook4.xhtml">Copyright page</a>
            </li>
            <li>
                <a href="mybook5.xhtml">Chapter 1</a>
            </li>
            <li>
                <a href="mybook7.xhtml">Chapter 2</a>
            </li>
        </ol>
    </nav>
    <nav epub:type="landmarks">
        <ol>
            <li>
                <a epub:type="cover" href="mybookl.xhtml">Cover</a>
            </li>
            <li>
                <a epub:type="bodymatter" href="mybook2.xhtml">Bodymatter</a>
            </li>
        </ol>
    </nav>
    <nav epub:type="page-list" hidden="">
        <ol>
            <li>
                <a href="mybookl.xhtml">l</a>
            </li>
            <li>
                <a href="mybook2.xhtml">2</a>
            </li>
            <li>
                <a href="mybook3.xhtml">3</a>
            </li>
            <li>
                <a href="mybook4.xhtml">4</a>
            </li>
            <li>
                <a href="mybook5.xhtml">5</a>
            </li>
            <li>
                <a href="mybook6.xhtml">6</a>
            </li>
            <li>
                <a href="mybook7.xhtml">7</a>
            </li>
            <li>
                <a href="mybook8.xhtml">8</a>
            </li>
            <li>
                <a href="mybook9.xhtml">9</a>
            </li>
            <li>
                <a href="mybooklO .xhtml">lO</a>
            </li>
        </ol>
    </nav>

    </html>

Probably important to note here that the href's (I noticed) are all to a specific .page/.xhtml file rather than in the outline file which has #pf... hrefs. Also, it doesn't have to be unpacked/decompressed/beautified code like I have above (this is true for all of the code pretty much).

Also, a question about using svg images, because I think the pdf2htmlex generator creates v1.2 svg but epub3 requires v1.1...I don't know, but according to the book I can safely change the 1.2 value to 1.1 without any problems using a sed command. Are there are any problems with using 1.1 for 1.2? If there are, let me know if there are any workarounds.

Also having trouble with getting the content.opf file to work (it's the file containing all the references to all the files with their type (manifest) and page order (spine). I believe it is this file that is causing the outline/nav.xhtml/tableofcontents to not work. The overall opf file has to look something like this:

<?xml version='l.0' encoding='UTF-8'?> unique-identifier="pub-id" version="3.0">
<metadata xmlns:dc="http://purl. org/dc/elements/1.1/">
    <dc:identifier id="pub-id">12345 6789</dc:identifier>
    <dc:title>The Best Book in the World</dc:title>
    <dc:creator>John Doe</dc:creator>
    <dc:publisher>Big Bang Editions</dc:publisher>
    <dc:language>en</dc:language>
    <dc:description>This book is about ebooks.</dc:description>
    <meta content="cover_image" name="cover" />
    <meta property="dcterms:modified">2 014-05-22T12:00:00Z</meta>
    <meta property="rendition:layout">pre-paginated</meta>
    <meta property="rendition:orientation">auto</meta>
    <meta property="rendition:spread">auto</meta>
</metadata>
<manifest>
    <item id="pagel" href="mybookl.xhtml" media-type="application/xhtml+xml" />
    <item id="page2" href="mybook2.xhtml" media-type="application/xhtml+xml" />
    <item id="page3" href="mybook3.xhtml" media-type="application/xhtml+xml" />
    <item id="page4" href="mybook4.xhtml" media-type="application/xhtml+xml" />
    <item id="page5" href="mybook5.xhtml" media-type="application/xhtml+xml" />
    <item id="page6" href="mybook6.xhtml" media-type="application/xhtml+xml" />
    <item id="page7" href="mybook7.xhtml" media-type="application/xhtml+xml" />
    <item id="page8" href="mybook8.xhtml" media-type="application/xhtml+xml" />
    <item id="page9" href="mybook9.xhtml" media-type="application/xhtml+xml" />
    <item id="pagelO" href="mybooklO .xhtml" media-type="application/xhtml+xml" />
    <item id="image-pagel" href="bgl. jpg" media-type="image/jpeg" />
    <item id="image-page5" href="bg5.jpg" media-type="image/jpeg" />
    <item id="image-page8" href="bg8. jpg" media-type="image/jpeg" />
    <item id="image-page9" href="bg9.jpg" media-type="image/jpeg" />
    <item id="image-pagelO" href="bga.jpg" media-type="image/jpeg" />
    <item id="fontl" href="fl.woff" media-type="application/font-woff" />
    <item id="font2" href="f2.woff" media-type="application/font-woff" />
    <item id="font3" href="f3.woff" media-type="application/font-woff" />
    <item id="font4" href="f4.woff" media-type="application/font-woff" />
    <item id="font5" href="f5.woff" media-type="application/font-woff" />
    <item id="font6" href="f6.woff" media-type="application/font-woff" />
    <item id="font7" href="f7.woff" media-type="application/font-woff" />
    <item id="font8" href="f8.woff" media-type="application/font-woff" />
    <item id="base-min-css" href="base.min.css" media-type="text/css" />
    <item id="mybook-css" href="mybook.css" media-type="text/css" />
    <item id="cover_image" href="cover.jpg" media-type="image/jpeg" properties="cover-image" />
    <item id="nav" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav" />
</manifest>
<spine>
    <itemref idref="pagel" properties="page-spread-right" />
    <itemref idref="page2" properties="page-spread-left" />
    <itemref idref="page3" properties="page-spread-right" />
    <itemref idref="page4" properties="page-spread-left" />
    <itemref idref="page5" properties="page-spread-right" />
    <itemref idref="page6" properties="page-spread-left" />
    <itemref idref="page7" properties="page-spread-right" />
    <itemref idref="page8" properties="page-spread-left" />
    <itemref idref="page9" properties="page-spread-right" />
    <itemref idref="pagelO" properties="page-spread-left" />
</spine>
<guide>
    <reference type="cover" title="Cover" href="pl.xhtml" />
    <reference type="text" title="Text" href="p2.xhtml" />
</guide>
</package>

Also, I believe I have to edit the base.min.css to be not include this at all:

;unicode-bidi:bidi-override

and I believe it is actually included twice in the base.min.css file. Is there any problem to doing this? I think not, but just want to make sure...

Luckily, that is all haha Those are all the issues I believe...I'm hoping that there files I can modify so that the conversion/generation process will automatically resolve the above things to work.

duanyao commented 8 years ago

Converting output of pdf2htmlex to epub3 is definitely doable (we did that actually). Of cause you have to do some programing to generate the navigation document and the OPF.

What epub readers are you targeting? I believe most epub3 readers are based on morden browser engines, and SVG produced by pdf2htmlex (via cario) should be viewable in browsers. Morden browsers support a mixture of SVG 1.1 and SVG 1.2 tiny, by the way.

I think unicode-bidi:bidi-override is related to right-to-left scripts (see #595). Why it must be removed?

YenForYang commented 8 years ago

I've read that (this might be outdated by now, as the book was published more than a year ago)

...This is because the official ePub version 3 specifications indicate that "the direction and unicode-bidi properties must not be inclu­ded in an EPUB Stylesheet. Authors should use appropriate HTML5 markup to express directionality information instead." If you do not re­move it, an error will appear during the ePub file validation step (epub­ check).

So basically I think it isn't much of a deal other than the fact that it wont be "validated" as a real epub I guess.

And I guess I'm not really targetting any specific device in particular, but I am definitely trying to get to an AZW3 file format (which is just an amazon-wrapped EPUB basically) from the generated EPUB (once I get there) that I can use at least on my Kindle Voyage. I think Kindlegen/Kindle Previewer does accept html files, but I've attempted to directly use pdf2htmlex generated html files already (and I got errors and failure basically) when trying to generated the basic MOBI.

duanyao commented 8 years ago

It seems the latest EPUB3 spec still forbids unicode-bidi. However if major epub3 readers accept unicode-bidi, I think it is not a big problem to violate the spec. If you must have the generated EPUB to pass a validator, you may dynamicly insert the unicode-bidi rule via JS.

anteprandium commented 8 years ago

@duanyao Would you care to comment on how you converted to fixed layout EPUB3. Would you share code and/or scripts?

I am very interested, for making PDF notes available to my students as EPUBs.

duanyao commented 8 years ago

@anteprandium I did convert output of pdf2htmlex to epub3, however it is not fixed layout epub3. In my opionin, fixed layout epub3 is poor in page transition performance because each page is a xhtml file, and can not be easily viewed in browsers. Instead, I put all pages into one xhtml file and implemented custom page transition and fullscreen in JS, so these features are independent to epub readers and can be viewed directly in browsers.

The converting tool I wrote is JS code running in browser. It parses the output xhtml of pdf2htmlex, and generates OPF, navigation and content files, and packages them into epub using zip.js. The output files of pdf2htmlex is also packaged as zip before converting. This tool is part of a commercial software and is not publicly available yet, sorry. If you don't have special requirements, maybe using sigil to import the output of pdf2htmlEX is easier, as mentioned by the original post.

The params of pdf2htmlEX I used looks like (hope helpful):

#!/bin/sh
pdf2htmlEX --fit-width 800 --hdpi 200 --vdpi 200 --font-size-multiplier 10 --heps 0.5 --veps 0.5 --css-filename content.css --embed cfijO --bg-format svg --svg-node-count-limit 1000 --svg-embed-bitmap 0 --correct-text-visibility 1 --dest-dir $1.sout $1 content.xhtml
anteprandium commented 8 years ago

@duanyao Thanks a lot for the answer. I agree that fixed layout is far from optimum. However, I have a very special use-case in mind: publishing of mathematical courses, university level. We do make PDFs available (CC), and in principle, epub3 should be able to deal with maths via MathML. Unfortunately, most epub readers simply don't. Also worth mentioning is that every produced file should use the same source, as to avoid duplication of effort. That's why LaTeX source files (with customised styles) + pdf2htmlex would make me happy.

So, coming back to the topic, I'll try to import into Sigil and run from there. I take the real problem is making the pdf2htmlex output acceptable to Sigil.

Thanks for the command line tips, it's a good starting point.

RNCTX commented 7 years ago

I have been putting together some shell scripts for just this purpose over the last couple of days.

As written, mine takes any PDF input, prompts the user for desired viewport size based on their reader's resolution, and gets the other input flags from the source file via Imagemagick math (scan resolution, original image size, etc). It then converts with SVG output and modifies the HTML/.pages files into the required XHTML files. Finally it prompts the user for a title and searches ISBNdb via API to see if there is existing metadata to be used for the content.opf and verifies a match with the user based on year of publication. ISBNdb has free access to their API for up to 500 hits a day, so anyone can use it. I'm doing this on OSX, which by virtue of having an 8 year old version of bash and BSD sed (lacking most modern regex options) should result in a very portable script. I have a FreeBSD NAS and an Ubuntu virtualbox to test with so will confirm usability on those as well.

Left to do: a loop through the generated files for the rest of the content.opf generation. A separate script with a loop that prompts the user for successive chapter names and page numbers for nav.xml TOC generation.

As you can probably guess based on the above my use case is also academic. For literature study purposes we have some very unwieldy texts based on their size (literature/criticism anthologies commonly run upwards of 3000 pages). The ones that are available in PDF format are all fine and good, but a PDF that size tends to crash tablet readers due to memory requirements. While commercial converters such as Abbyy Finereader do an admirable job of producing epubs from most pdfs, they also process the input in RAM for the most part and will crash on a 3000 page document. A shell script that generates the files as it goes, page by page, is therefore preferable.

I may also integrate the latest version of tesseract in this for OCR, you should check it out if you haven't, it does a fantastic job of identifying old/dirty scans and can output a PDF with a hidden text layer. I think the combination of tesseract v4 and pdf2htmlEX could be the "holy grail" of page-accurate ebook generation, it's just a matter of scripting the process so that non-programmers can use the tools. Unfortunately the makers of other utilities for doing this stuff are quick to get into political debates about the merits (or lack thereof) of book formatting. The needs of entertainment reading and academic reading are quite different, and since the academic world will not be giving up page-based citations any time soon, "just let it reflow" is not viable.

anteprandium commented 7 years ago

@RNCTX That sounds great! I, for one, would be interested in your scripts, if you would share.

RNCTX commented 7 years ago

I will do so when I get a working example, maybe a week or two more. The process will be three parts...

1) user viewport prompt, pdf2htmlEX processing, pdf2htmlEX output organization/conversion

2) metadata scraping based on user input

3) toc generation based on user input of successive chapter titles/pages, or a pre-formatted list with equivalent data.

RNCTX commented 7 years ago

https://github.com/RNCTX/PDF2HTMLEX-EPUB3FIXED