coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.37k stars 1.84k forks source link

HTML optimization #104

Closed coolwanglu closed 11 years ago

coolwanglu commented 11 years ago

Crocdoc is (once again) a good one to learn from

jahewson commented 11 years ago

+1, reducing the number of <div>s and <span>s will be a huge boost to performance.

jahewson commented 11 years ago

Could sub/superscripts use CSS vertical-align with a length ?

coolwanglu commented 11 years ago

Oh I didn't know it can take a length as the value. I've just checked the CSS standard, seems to be better than relative positioning. Issue updated. Thanks!

Hengjie commented 11 years ago

Yes I agree, reducing the amount of divs is going to mean reflowing the browser will be faster.

iclems commented 11 years ago

I just discovered the project and would love to get involved as I worked on similar stuff a year ago. Just a quick hint (maybe you already know this): to fix the issue of WebKit and decimals not being taken into account for letter-spacing for instance => you can multiply all your values by X then use a CSS transform to scale down by a factor of X and then the decimals do work

coolwanglu commented 11 years ago

@iclems Thanks for the message.

Actually the scaling trick has always been in there since an very earlier version.

There are still some issues marked as 'need solution', to which I have not been able to figure out solutions. Maybe you may share some of your thoughts?

iclems commented 11 years ago

Thanks ! I've been having a look at the project today and I'm now getting familiar with the way things are done. Meeting again my old friend Poppler... I remember having thought about how to properly optimize the background image, try to have a fast enough conversion, etc... Good example of a small PDF very slow to convert and very big once converted (and just 1.7Mo though in PDF) : http://clement.wehrung.free.fr/scaling.pdf

I'll probably be able to start focusing on some specific issues next week by the way, do you have any priority list ?

coolwanglu commented 11 years ago

OK, Thanks for the PDF. I'll take a look tomorrow.

I'm now trying to reduce the number of <div> for positional shifts, by filling into space characters or adjusting word-space.

I think you may just pick up any one you found interesting. And I'd like to recommend #39, which is serious and doable for now. I'm not sure if you are familiar with dealing with clipping paths, I've no experience at all.

I'd like to explain the codebase and discuss about possible solutions with you. Thanks!

jahewson commented 11 years ago

Hi @iclems, I have a similar background - familiarity with Poppler, and now starting to make some small contributions to pdf2htmlEX. I'm actually working on #39 at the moment, rather slowly.

This issue - reducing the number of divs - is in my opinion one of the most important because of the impact on performance. I'd recommend trying out some of your typical PDFs and seeing if any features you care about are missing - that's how I ended up adding stroked text.

coolwanglu commented 11 years ago

I've finished the optimization of word-space, and letter-space will follow up soon. As I tested in Chrome, this optimization would bring about 10% performance gain.

jahewson commented 11 years ago

10% performance gain

Would that be DOM memory, HTML file size, or frame rate?

coolwanglu commented 11 years ago

Oh, it was the time for parsing and rendering the entire document (with lazy rendering disabled)

On Tue, Apr 2, 2013 at 6:23 PM, John Hewson notifications@github.comwrote:

10% performance gain

Would that be DOM memory, HTML file size, or frame rate?

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/104#issuecomment-15767484 .

jahewson commented 11 years ago

Ok. Btw - I think you should keep the un-optimized text generation mode, and have a flag --optimize-text which is 1 by default, for debugging.

jahewson commented 11 years ago

Have you tried looking at the DOM memory in the Chrome's Task Manager?

coolwanglu commented 11 years ago

Right, I'll add it.

On Tue, Apr 2, 2013 at 6:30 PM, John Hewson notifications@github.comwrote:

Ok. Btw - I think you should keep the un-optimized text generation mode, and have a flag --optimize-text which is 1 by default, for debugging.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/104#issuecomment-15767763 .

coolwanglu commented 11 years ago

No, let me do a comparison of the optimized and not-optimized versions

On Tue, Apr 2, 2013 at 6:31 PM, John Hewson notifications@github.comwrote:

Have you tried looking at the DOM memory in the Chrome's Task Manager?

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/104#issuecomment-15767808 .

iclems commented 11 years ago

Hi @jahewson

I have a few concerns for now, and will try to start thinking on how I could contribute today :

coolwanglu commented 11 years ago

Comparison with demo.pdf. It is a scientific paper, which should be able to enjoy the optimization most.

_yes is with optimization and _no is not

Selection_003

loading time: about 2s for _yes and about 2.7s for _no

coolwanglu commented 11 years ago

@jahewson what does proportional memory (the last column) mean?

coolwanglu commented 11 years ago

@iclems

Indeed pdf2htmlEX is very slow converting your sample PDF. There are too many pages for it. I've just checked pdftohtml from poppler, which is able to process the same file very fast. I'll try to find the cause.

One possible solution is to use multiple threads, since rendering background image of each page is independent to each other. And fortunately, poppler has just become thread-safe since a recent version.

Visibility test, indeed, even harder than #39 where we may simply estimate the clipping path as a rectangle. I've been thinking about this, but no good idea so far. Maybe we may estimate each object by its bounding box, and test the visibility in the preprocessor.

About cutting the background image. That should be intuitive and useful, how did you do that?

Actually I've tried to dump every image object in PDF and put them directly into HTML. But it did not work due to clipping paths, also there may be other drawing objects. I also tried to at least detect "if there is anything on the background", (there is a bg_integrate branch, which has not been maintained for a while), which did not work well either, since a simple header/footer will make the background nonempty.

In the bg_integrated path, I also attempted to employ SVG for the visibility issue, but it turned out to be too complicated to me. Crocdoc seems to support render in SVG now, I never succeeded in viewing them though, they always froze my browsers.

iclems commented 11 years ago

@coolwanglu

Thanks for the long reply :) Could I have your mail to send you a link to some source ?

I think visibility test is not the #1 priority. Most probably :
1) fixing the background issue which both increases the generation time and makes the page weight much bigger than required (best would be to be able to put the background color in CSS and have a "per image" absolute positioning / otherwise, a quick compare would help to reduce the file size as most probably a lot of background images will just be the same if it's only about the background color...) 2) improving generation speed, (may be a lot improved by #1) 3) testing fonts (at that time, I had a lot of pain with specific font issues),

jahewson commented 11 years ago

@jahewson what does proportional memory (the last column) mean?

@coolwanglu the columns should be:

The most important value is Resident, which is the first column. So you're seing a 23% reduction in RAM with your optimizations - great! (93MB -> 72MB)

coolwanglu commented 11 years ago

@iclems My email is available in README

jahewson commented 11 years ago

@iclems, yep these are tricky issues:

  • "z-index" issue : the eternal issue with the approach consisting in placing everything that is not text in the background takes place one you have "elements" hiding text (a lot of designers do it in InDesign => they just hide some text elements with a white square, or put an image on top of it and never remove the text behind) = it's quite complicated to find an issue to this issue as it would involve for each object to check its "visibility"

It could be done by sending all the drawing commands to a polygon clipper, and pruning any text which gets drawn over (where the text rectangle intersects the drawing polygon). It's a very big job.

Alternatively, if each drawing command was rendered to a separate transparent PNG image, then the problem goes away, as does the problem below.

  • reducing the background-image issue : part of the poppler speed issue is due IMO to the rasterizing of the big background-image. In some cases, I have noticed that a non transparent background color can lead to generating one big image for each page. Do you know anything about this ? Do you consider it an issue ? I had as well been working a year ago on a custom approach trying to cut the background image in non-empty smaller images which would just be positioned absolutely. What do you think ?

[...] best would be to be able to put the background color in CSS and have a "per image" absolute positioning

"per image" absolute positioning, for image objects that's fine, but what about paths? These would need to be rendered into separate images, it could be done.

The simplest approach might be to keep track of the min/max x and y values used for drawing, and crop the background to that size.

coolwanglu commented 11 years ago

@jahewson I wonder if per-path images would introduce too many overhead. For example, why people use CSS sprites? I think maybe we need some clustering algorithms.

About polygon clipper, do you know any light-weight geometry libraries, for example CGAL?

About image objects can also be clipped, and thus cannot be directly dumped and inserted to HTML.

jahewson commented 11 years ago

I wonder if per-path images would introduce too many overhead.

There's only one way to find out...

For example, why people use CSS sprites?

Because they look good on retina displays, and scale well with zoom. I don't think that size or overhead are the reasons people choose CSS sprites.

jahewson commented 11 years ago

About polygon clipper, do you know any light-weight geometry libraries, for example CGAL?

http://www.angusj.com/delphi/clipper.php

coolwanglu commented 11 years ago

Looks great, but seems that bezier curves are not supported.

Bezier curves might be used in cilpping paths, drawing objects.

hmm..

coolwanglu commented 11 years ago

@iclems Futher tests suggest that pdftohtml was not so fast.

Previously I was not using the -c parameter, such that images are not processed carefully With the -c paramter, the speed of pdftohtml is similar as pdf2thmlEX (with the same scaling)

I guess this is the best poppler can do (with current parameters)

coolwanglu commented 11 years ago

Just realized that #64 is about visibility test

coolwanglu commented 11 years ago

The first item seems not to be able to bring performance improvements. Probably the only good thing about it is that it would possibly prevent vertical overlapping caused by rounded font sizes by the browsers, which never happened to me.

I've created HTMLTextPage which allows future optimizations, but the rest part seems to be dull to me.

The last 2 items have been implemented and indeed improve the performance.