coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.38k stars 1.84k forks source link

All text missing from page #401

Open davidhedley opened 10 years ago

davidhedley commented 10 years ago

All text is missing from the output when converted to HTML.

Test case here: http://download.vistair.com/pdf2htmlEX/Page-1fromTCX-B757-AFM-28A_nowatermark.pdf

Version info: Copyright 2012-2014 Lu Wang coolwanglu@gmail.com and other contributors Libraries: poppler 0.26.3 libfontforge 20140801 cairo 1.13.1 Default data-dir: /usr/local/share/pdf2htmlEX Supported image format: png jpg svg

coolwanglu commented 10 years ago

This file uses type 3 fonts, so you need the --process-type3 option. However seems that something wrong with the fonts, that cairo refuses to convert them.

davidhedley commented 10 years ago

As an update to this, I pre-processed that page with ghostscript first to embed fonts and clean it up. The test case is here: http://download.vistair.com/pdf2htmlEX/Page-1fromTCX-B757-AFM-28A_nowatermark.processed.pdf

On the original file:

pdf2htmlEX --process-type3 1 Page-1fromTCX-B757-AFM-28A_nowatermark.pdf
Preprocessing: 1/1
Error: Cairo error: invalid matrix (not invertible)

Result = no text on page.

When I try the "cleaned" PDF, I get:

pdf2htmlEX --process-type3 1 Page-1fromTCX-B757-AFM-28A_nowatermark.processed.pdf
Preprocessing: 1/1
zsh: segmentation fault (core dumped)  pdf2htmlEX --process-type3 1 Page-1fromTCX-B757-AFM-28A_nowatermark.processed.pdf

On different Linux system, instead of a core dump, pdf2htmlEX hangs forever.