Closed coolwanglu closed 11 years ago
+1, reducing the number of <div>
s and <span>
s will be a huge boost to performance.
Could sub/superscripts use CSS vertical-align
with a length ?
Oh I didn't know it can take a length as the value. I've just checked the CSS standard, seems to be better than relative positioning. Issue updated. Thanks!
Yes I agree, reducing the amount of divs is going to mean reflowing the browser will be faster.
I just discovered the project and would love to get involved as I worked on similar stuff a year ago. Just a quick hint (maybe you already know this): to fix the issue of WebKit and decimals not being taken into account for letter-spacing for instance => you can multiply all your values by X then use a CSS transform to scale down by a factor of X and then the decimals do work
@iclems Thanks for the message.
Actually the scaling trick has always been in there since an very earlier version.
There are still some issues marked as 'need solution', to which I have not been able to figure out solutions. Maybe you may share some of your thoughts?
Thanks ! I've been having a look at the project today and I'm now getting familiar with the way things are done. Meeting again my old friend Poppler... I remember having thought about how to properly optimize the background image, try to have a fast enough conversion, etc... Good example of a small PDF very slow to convert and very big once converted (and just 1.7Mo though in PDF) : http://clement.wehrung.free.fr/scaling.pdf
I'll probably be able to start focusing on some specific issues next week by the way, do you have any priority list ?
OK, Thanks for the PDF. I'll take a look tomorrow.
I'm now trying to reduce the number of <div>
for positional shifts, by filling into space characters or adjusting word-space
.
I think you may just pick up any one you found interesting. And I'd like to recommend #39, which is serious and doable for now. I'm not sure if you are familiar with dealing with clipping paths, I've no experience at all.
I'd like to explain the codebase and discuss about possible solutions with you. Thanks!
Hi @iclems, I have a similar background - familiarity with Poppler, and now starting to make some small contributions to pdf2htmlEX. I'm actually working on #39 at the moment, rather slowly.
This issue - reducing the number of divs - is in my opinion one of the most important because of the impact on performance. I'd recommend trying out some of your typical PDFs and seeing if any features you care about are missing - that's how I ended up adding stroked text.
I've finished the optimization of word-space
, and letter-space
will follow up soon.
As I tested in Chrome, this optimization would bring about 10% performance gain.
10% performance gain
Would that be DOM memory, HTML file size, or frame rate?
Oh, it was the time for parsing and rendering the entire document (with lazy rendering disabled)
On Tue, Apr 2, 2013 at 6:23 PM, John Hewson notifications@github.comwrote:
10% performance gain
Would that be DOM memory, HTML file size, or frame rate?
— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/104#issuecomment-15767484 .
Ok. Btw - I think you should keep the un-optimized text generation mode, and have a flag --optimize-text
which is 1
by default, for debugging.
Have you tried looking at the DOM memory in the Chrome's Task Manager?
Right, I'll add it.
On Tue, Apr 2, 2013 at 6:30 PM, John Hewson notifications@github.comwrote:
Ok. Btw - I think you should keep the un-optimized text generation mode, and have a flag --optimize-text which is 1 by default, for debugging.
— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/104#issuecomment-15767763 .
No, let me do a comparison of the optimized and not-optimized versions
On Tue, Apr 2, 2013 at 6:31 PM, John Hewson notifications@github.comwrote:
Have you tried looking at the DOM memory in the Chrome's Task Manager?
— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/104#issuecomment-15767808 .
Hi @jahewson
I have a few concerns for now, and will try to start thinking on how I could contribute today :
Comparison with demo.pdf
. It is a scientific paper, which should be able to enjoy the optimization most.
_yes
is with optimization and _no
is not
loading time:
about 2s for _yes
and about 2.7s for _no
@jahewson what does proportional memory
(the last column) mean?
@iclems
Indeed pdf2htmlEX is very slow converting your sample PDF. There are too many pages for it. I've just checked pdftohtml
from poppler, which is able to process the same file very fast. I'll try to find the cause.
One possible solution is to use multiple threads, since rendering background image of each page is independent to each other. And fortunately, poppler has just become thread-safe since a recent version.
Visibility test, indeed, even harder than #39 where we may simply estimate the clipping path as a rectangle. I've been thinking about this, but no good idea so far. Maybe we may estimate each object by its bounding box, and test the visibility in the preprocessor.
About cutting the background image. That should be intuitive and useful, how did you do that?
Actually I've tried to dump every image object in PDF and put them directly into HTML. But it did not work due to clipping paths, also there may be other drawing objects. I also tried to at least detect "if there is anything on the background", (there is a bg_integrate branch, which has not been maintained for a while), which did not work well either, since a simple header/footer will make the background nonempty.
In the bg_integrated path, I also attempted to employ SVG for the visibility issue, but it turned out to be too complicated to me. Crocdoc seems to support render in SVG now, I never succeeded in viewing them though, they always froze my browsers.
@coolwanglu
Thanks for the long reply :) Could I have your mail to send you a link to some source ?
I think visibility test is not the #1 priority. Most probably :
1) fixing the background issue which both increases the generation time and makes the page weight much bigger than required (best would be to be able to put the background color in CSS and have a "per image" absolute positioning / otherwise, a quick compare would help to reduce the file size as most probably a lot of background images will just be the same if it's only about the background color...)
2) improving generation speed, (may be a lot improved by #1)
3) testing fonts (at that time, I had a lot of pain with specific font issues),
@jahewson what does proportional memory (the last column) mean?
@coolwanglu the columns should be:
The most important value is Resident, which is the first column. So you're seing a 23% reduction in RAM with your optimizations - great! (93MB -> 72MB)
@iclems My email is available in README
@iclems, yep these are tricky issues:
- "z-index" issue : the eternal issue with the approach consisting in placing everything that is not text in the background takes place one you have "elements" hiding text (a lot of designers do it in InDesign => they just hide some text elements with a white square, or put an image on top of it and never remove the text behind) = it's quite complicated to find an issue to this issue as it would involve for each object to check its "visibility"
It could be done by sending all the drawing commands to a polygon clipper, and pruning any text which gets drawn over (where the text rectangle intersects the drawing polygon). It's a very big job.
Alternatively, if each drawing command was rendered to a separate transparent PNG image, then the problem goes away, as does the problem below.
- reducing the background-image issue : part of the poppler speed issue is due IMO to the rasterizing of the big background-image. In some cases, I have noticed that a non transparent background color can lead to generating one big image for each page. Do you know anything about this ? Do you consider it an issue ? I had as well been working a year ago on a custom approach trying to cut the background image in non-empty smaller images which would just be positioned absolutely. What do you think ?
[...] best would be to be able to put the background color in CSS and have a "per image" absolute positioning
"per image" absolute positioning, for image objects that's fine, but what about paths? These would need to be rendered into separate images, it could be done.
The simplest approach might be to keep track of the min/max x and y values used for drawing, and crop the background to that size.
@jahewson I wonder if per-path images would introduce too many overhead. For example, why people use CSS sprites? I think maybe we need some clustering algorithms.
About polygon clipper, do you know any light-weight geometry libraries, for example CGAL?
About image objects can also be clipped, and thus cannot be directly dumped and inserted to HTML.
I wonder if per-path images would introduce too many overhead.
There's only one way to find out...
For example, why people use CSS sprites?
Because they look good on retina displays, and scale well with zoom. I don't think that size or overhead are the reasons people choose CSS sprites.
About polygon clipper, do you know any light-weight geometry libraries, for example CGAL?
Looks great, but seems that bezier curves are not supported.
Bezier curves might be used in cilpping paths, drawing objects.
hmm..
@iclems Futher tests suggest that pdftohtml was not so fast.
Previously I was not using the -c
parameter, such that images are not processed carefully
With the -c
paramter, the speed of pdftohtml is similar as pdf2thmlEX (with the same scaling)
I guess this is the best poppler can do (with current parameters)
Just realized that #64 is about visibility test
The first item seems not to be able to bring performance improvements. Probably the only good thing about it is that it would possibly prevent vertical overlapping caused by rounded font sizes by the browsers, which never happened to me.
I've created HTMLTextPage which allows future optimizations, but the rest part seems to be dull to me.
The last 2 items have been implemented and indeed improve the performance.
Crocdoc is (once again) a good one to learn from
display:block
and propermargin-top
valuesmargin-top
classes thany axis
top
and relative positioningvertical-align
seems to be better