coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.36k stars 1.84k forks source link

Segmentation fault caused by fontforge - how to fix #718

Open strum3nt opened 7 years ago

strum3nt commented 7 years ago

I built this on an AWS EC2 instance using (effectively) the bash script here https://gist.github.com/rajeevkannav/d07f822e209a22d07176

When I tried executing PDF2HTMLEX on a pdf file I got a segmentation fault error.

I rebuilt fontforge from coolwanglu's fork https://github.com/coolwanglu/fontforge/tree/pdf2htmlEX. It now works fine.

brho commented 6 years ago

It seems like new versions of fontforge don't support the old API that pdf2html is using. My system has version fontforge-20170731 (gentoo, via the overlay ebuild). When I build, I get a bunch of implicit function warnings. e.g.

src/pdf2htmlEX/src/util/ffw.c:57:5: warning: implicit declaration of function 'InitSimpleStuff' [-Wimplicit-function-declaration] InitSimpleStuff(); ^

One of these warnings leads to

src/pdf2htmlEX/src/util/ffw.c:303:17: warning: passing argument 1 of 'strcopy' makes pointer from integer without a cast [-Wint-conversion] src/pdf2htmlEX/src/util/ffw.c:41:15: note: expected 'const char *' but argument is of type 'int'

which is the source of the segfault. The segault I had was

0 0x00007ffff60f0e36 in strlen () from /lib64/libc.so.6

1 0x00007ffff60f0abe in strdup () from /lib64/libc.so.6

2 0x000000000044cebe in strcopy ()

3 0x000000000044d693 in ffw_add_empty_char ()

4 0x000000000044059e in pdf2htmlEX::HTMLRenderer::embed_font(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, GfxFont*, pdf2htmlEX::FontInfo&, bool) ()

5 0x0000000000441707 in pdf2htmlEX::HTMLRenderer::install_embedded_font(GfxFont*, pdf2htmlEX::FontInfo&) ()

6 0x0000000000442198 in pdf2htmlEX::HTMLRenderer::install_font(GfxFont*) ()

7 0x0000000000445995 in pdf2htmlEX::HTMLRenderer::check_state_change(GfxState*) ()

8 0x000000000044689c in pdf2htmlEX::HTMLRenderer::drawString(GfxState, GooString) ()

9 0x00007ffff7a68020 in Gfx::doShowText(GooString*) () from /usr/lib64/libpoppler.so.68

10 0x00007ffff7a6870d in Gfx::opShowSpaceText(Object*, int) () from /usr/lib64/libpoppler.so.68

11 0x00007ffff7a60459 in Gfx::go(bool) () from /usr/lib64/libpoppler.so.68

12 0x00007ffff7a60930 in Gfx::display(Object*, bool) () from /usr/lib64/libpoppler.so.68

13 0x00007ffff7aa8d55 in Page::displaySlice(OutputDev, double, double, int, bool, bool, int, int, int, int, bool, bool ()(void), void, bool ()(Annot, void), void, bool) () from /usr/lib64/libpoppler.so.68

14 0x00007ffff7aa8fb8 in Page::display(OutputDev, double, double, int, bool, bool, bool, bool ()(void), void, bool ()(Annot, void), void, bool) ()

from /usr/lib64/libpoppler.so.68

15 0x000000000043994d in pdf2htmlEX::HTMLRenderer::process(PDFDoc*) ()

16 0x000000000042056f in main ()

Anyway, I'm not familiar with fontforge, its history, or its APIs. Maybe the right fix is to convert pdf2html to use the more recent interfaces? If not, I'd either put the old version in your package or otherwise make it clear that newer versions of fontforge won't work. I get this problem both with the gentoo ebuild and with building from source (as you'd expect).

akhuettel commented 6 years ago

Well, the fix is as simple as it is evil.

https://github.com/akhuettel/pdf2htmlEX/commit/46160f0c83b02a6fd7f23c40b7fab612f913f8c4

pdf2htmlEX requires an interface that is not exported by fontforge in its headers anymore. That's why the messages about implicit function definitions pop up. Also, something in this interface changed, and function call (in pdf2htmlEX) and implementation (in fontforge) do not fit together anymore. That's why it crashes on runtime.

The "fix" is to copy fontforge internal headers into pdf2htmlEX and include them. However, the resulting binary will likely only work with the version of fontforge that provided the headers...