coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.35k stars 1.84k forks source link

Thin lines not being rendered #430

Open davidhedley opened 10 years ago

davidhedley commented 10 years ago

The poppler-splash backend seems to have problems rendering thin lines in PDFs. See http://download.vistair.com/pdf2htmlEX/thinlines.pdf for an example PDF. The output from pdftoppm -png is shown in http://download.vistair.com/pdf2htmlEX/thinlines-splash.png

Interestingly, the poppler-cairo backend does not have this issue (pdftocairo -png output shown in http://download.vistair.com/pdf2htmlEX/thinlines-cairo.png).

Therefore, to fix this issue (apart from reporting it to poppler), there are 2 options:

  1. Switch pdf2htmlEX to use the Cairo backend. This will obviously have other implications, but might be worth investigating.
  2. Fix pdf2htmlEX to ensure a minimum line width before drawing the line.

I have implemented a patch for (2.) which works in this case. In SplahBackgroundRenderer.cc, I have done the following:

void SplashBackgroundRenderer::updateLineWidth(GfxState *state) {
        if (state->getTransformedLineWidth() < 1.0) {
                state->setLineWidth(1.0 / state->transformWidth(1.0));
        }
        SplashOutputDev::updateLineWidth(state);
}

This will ensure the (transformed) line width is at least 1 unit wide (in user space). This is obviously a bit of a hack and the fix should really be done at the rasterization stage, but it works well for me and sucessfully renders the example page.

duanyao commented 10 years ago

Interesting.

My test show that pdftocairo works a little better than pdftoppm (without other parameters), but can still lose thin lines in your PDF.

Actually pdftoppm has an option -thinlinemode, values can be none | solid | shape, and solid is exactly what your patch is trying to do. shape seems utilize gray level to represent lines thinner than 1px. However, none is the default which can lose thin lines. Maybe we want to expose this option to pdf2htmlEX users. @davidhedley, would you like to do this?

Cairo backend has been already used by pdf2htmlEX for SVG background output, so I think it is not hard to do (1). However, pdftocairo don't have -thinlinemode or something similar, so pdftoppm and splash back end still looks superior as far as bitmap output is concerned.

davidhedley commented 10 years ago

I tried pdftocairo 0.24.5 and it lost some lines on the example PDF, but testing with pdftocairo 0.26.3, they were all present so I assumed this issue had been fixed.

I didn't realise about the thinlinemode in pdftoppm - that seems like it would be the best solution. Does it really need to be an option or should we just set it to "shape" by default?

davidhedley commented 10 years ago

And in fact the thinlinemode produces a much better result. So in SplashBackgroundRenderer we just change:

SplashBackgroundRenderer::SplashBackgroundRenderer(const string & imgFormat, HTMLRenderer * html_renderer, const Param & param)
    : SplashOutputDev(splashModeRGB8, 4, gFalse, (SplashColorPtr)(&white), gTrue, gTrue)
    , html_renderer(html_renderer)
    , param(param)
    , format(imgFormat)

to

SplashBackgroundRenderer::SplashBackgroundRenderer(const string & imgFormat, HTMLRenderer * html_renderer, const Param & param)
    : SplashOutputDev(splashModeRGB8, 4, gFalse, (SplashColorPtr)(&white), gTrue, gTrue, splashThinLineShape)
    , html_renderer(html_renderer)
    , param(param)
    , format(imgFormat)

Is any reason why you would not want to enable thinlinemode?

duanyao commented 10 years ago

I'm OK to have -thinlinemode shape by default, but others may prefer solid. I don't know why does pdftoppm let none be the default, maybe it is faster?

My popper is 0.26.1, seems a little outdated.

davidhedley commented 10 years ago

Actually the default for Splash is not strictly "none". From SplashTypes.h:

enum SplashThinLineMode {
  splashThinLineDefault,  // if SA on: draw solid if requested line width, transformed into
                          // device space, is less than half a pixel and a shaped line else
  splashThinLineSolid,     // draw line solid at least with 1 pixel
  splashThinLineShape     // draw line shaped at least with 1 pixel
};

So default behaviour is dependent on the Stroke Adjustment setting in the PDF. However I guess if SA is off, then nothing happens to thin lines and they get dropped which is not good.

I'm doing some testing now, but it would seem that splashThinLineShape produces good results - much more uniform line weights than the "solid" setting.

duanyao commented 10 years ago

According to the PDF spec:

10.6.4 Scan Conversion Rules ... A shape shall be scan-converted by painting any pixel whose square region intersects the shape, no matter how small the intersection is. This ensures that no shape ever disappears as a result of unfavourable placement relative to the device pixel grid, as might happen with other possible scan conversion rules. The area covered by painted pixels shall always be at least as large as the area of the original shape. This rule applies both to fill operations and to strokes with nonzero width. Zero-width strokes may be done in an implementation-defined manner that may include fewer pixels than the rule implies. ...

It seems the spec doesn't allow any shape being dropped, no matter whether "Stroke Adjustment" is on. So it is still a bug of splash back end.

coolwanglu commented 10 years ago

If there's a line of width 0.5px, and we zoom the PDF file by 2x when converting it to html, will it be 1px or 2px in the output?

I think the best solution would be to create a new option, and set the default value to shape maybe.