Kozea / WeasyPrint

The awesome document factory
https://weasyprint.org
BSD 3-Clause "New" or "Revised" License
7.23k stars 686 forks source link

Performance issue on long documents #578

Open fcaldas opened 6 years ago

fcaldas commented 6 years ago

Hi I am working with long reports (~100 pages) and in some cases the rendering time is a bit overwhelming. At first I thought the issue was related with loading images (about 4 images / page) but after some debugging I noticed the code is extremely CPU intensive.

I did some profiling on the report using python profile module on the latest version and I got this:

ncalls tottime percall cumtime percall filename:lineno(function)
1517881 67.733 0 69.799 0 text.py:667(iter_lines)
146796 13.871 0 176.298 0.001 text.py:912(split_first_line)
258624 10.651 0 58.431 0 text.py:828(create_layout)
258630 8.032 0 28.841 0 text.py:623(init)
696704 7.967 0 16.909 0 text.py:676(set_text)
1771847 6.502 0 12.326 0 api.py:225(new)
518259 5.534 0 5.534 0 init.py:50(call)
1732555 5.067 0 11.3 0 :1()
2605233 4.607 0 4.607 0 text.py:864(add_attr)
1221166 4.506 0 15.512 0 text.py:559(unicode_to_char_p)
2608573 3.498 0 3.498 0 :0(find)
258720 2.875 0 6.886 0 surfaces.py:785(init)
2036614 2.584 0 2.721 0 api.py:163(_typeof)
146802 2.476 0 9.528 0 text.py:585(first_line_metrics)
1366910 2.303 0 3.689 0 api.py:396(gc)
547080 2.278 0 6.154 0 text.py:569(get_size)
1771847 2.148 0 2.148 0 :0(newp)
1618829 2.041 0 2.044 0 :0(encode)
2302764 1.87 0 1.87 0 :0(get)
1594209 1.821 0 1.821 0 :0(replace)
27424 1.549 0 12.643 0 init.py:105()
258857 1.412 0 3.148 0 surfaces.py:130(init)
1366910 1.386 0 1.386 0 :0(gcp)
477599 1.226 0 2.115 0 percentages.py:32(resolve_one_percentage)
258624 1.189 0 1.189 0 text.py:726(get_font_features)
258831 1.113 0 2.507 0 context.py:103(_init_pointer)
39037 1.092 0 2.858 0 init.py:477(computed_from_cascaded)
264768 1.018 0 1.935 0 api.py:276(cast)
36148 1.015 0 3.573 0 percentages.py:53(resolve_percentages)
675758 0.945 0 0.945 0 :0(append)

It looks like over half the time is spent on text.py, since I am not familiar with the code I was wondering if someone could explain a bit more what is happening and if there is something I can do to improve the performance for this specific report.

Best,

liZe commented 6 years ago

It looks like over half the time is spent on text.py, since I am not familiar with the code I was wondering if someone could explain a bit more what is happening and if there is something I can do to improve the performance for this specific report.

Most of the time is taken by split_first_line, a function that splits text lines (wow) according to the layout, CSS attributes, fonts, etc.

Our mission is:

  1. not to call this function when it's not needed, and
  2. make this function faster.

I think that we'll get better results with 1. than with 2., but it depends on what's in your document. If you have a lot of text in blocks whose size depends on their content (for example floats, tables…), then you generally have to split the text twice per block and per page (once to get the minimum size and once to get the real layout) and thus have to call split_first_line many, many times. If you have very long paragraphs an use less common features like hyphenation, overflow-wrap or white-space, then optimizing split_first_line for your use cases may be helpful.

Is it possible to get one of your reports HTML+CSS? I'll give you hints about where to start.

fcaldas commented 6 years ago

Thanks @liZe, I am sending the document and css/html to you by email, the document is composed of a lot of tables and some of them might wrap over more than one page. It takes about 2 minutes to render on my notebook.

fcaldas commented 6 years ago

Hi @liZe,

It seems like the slow performance is related with the following CSS I am using:

td{ word-wrap: break-word; } Removing it from the report and from the style.css makes it render 4 times faster (under 30s now for 89 pages)

liZe commented 6 years ago

It seems like the slow performance is related with the following CSS I am using:

td{ word-wrap: break-word; }

Thanks a lot. Using word-wrap/overflow-wrap may lead to specific paths that are not optimized, as said in my latest comment. I'll find what the problem is.

(under 30s now for 89 pages)

That's much better :smile:.

liZe commented 6 years ago

Here's what's happening:

Moreover, there's something strange in Pango that makes WeasyPrint render the Pango layout twice.

The part of the code that can be optimized is here.

Tontyna commented 6 years ago

Just an idea.

Quoting the docs on pango_cairo_create_layout()

Creates a layout object set up to match the current transformation and target surface of the Cairo context. [...] This function is the most convenient way to use Cairo with Pango, however it is slightly inefficient since it creates a separate PangoContext object for each layout. This might matter in an application that was laying out large amounts of text.

Bold applied by me.

This function is called in split_first_line for each Layout() it requires. Each time I stumbled on Layout.__init__ in the last days I was wondering why (or whether) for each text snippet, sometimes multiple times, a new PangoLayout is necessary.

The relevant code

# text.py, line 626
    cairo_dummy_context = (
        cairo.Context(cairo.ImageSurface(cairo.FORMAT_ARGB32, 1, 1))
        if hinting else cairo.Context(cairo.PDFSurface(None, 1, 1)))
    self.layout = ffi.gc(
        pangocairo.pango_cairo_create_layout(ffi.cast(
            'cairo_t *', cairo_dummy_context._pointer)),
        gobject.g_object_unref)

looks to me like it's alway the same PangoLayout, created from the same cairo_dummy_context. Looks like it could/should be reused.

liZe commented 6 years ago

Looks like it could/should be reused.

It's hard (that hard?) to use the same layout, as we currently use a different layout for each box and don't rely on the powerful but limited options provided by Pango to split lines and use different font sizes / weights.

But as a first step, it's easy to create only once the dummy context used to create this layout. I've done this in db401ac and it saves a lot of time. It doesn't break tests and looks harmless.

Just an idea.

Please share your ideas whenever you want :wink:.

kleptog commented 5 years ago

Hoi, we had a similar performance issue with documents that accidentally contained lines which consisted of "words" about 100kB each. Weasyprint spent a lot of time rendering this (more than 1.5 hr, I didn't wait).

I agree that the split_first_line is the place that needs to be optimised, but I can't quite follow the algorithm so not sure which bit to tweak. It is clear that it expects words to be short.

Anyway, to workaround this we added the following shortcut:

--- /usr/lib/python3.5/site-packages/weasyprint/text.py.orig    2019-05-08 15:27:39.059923974 +0200
+++ /usr/lib/python3.5/site-packages/weasyprint/text.py 2019-05-08 15:27:07.155805266 +0200
@@ -959,6 +959,12 @@
     if not text_wrap:
         max_width = None

+    # Cheat to handle the extreme case of super long words
+    # Basically, prevents weasyprint from running for hours
+    if style['size'][0] and style['font_size']:
+        if len(text) > int(style['size'][0] / style['font_size'] * 2.5):
+            text = text[:int(style['size'][0] / style['font_size'] * 2.5)]
+
     # Step #1: Get a draft layout with the first line
     layout = None
     if (max_width is not None and max_width != float('inf') and

I'm not sure this is a good fix for the general case, but at least in our case it reduces the rendering time for these problematic documents to seconds.

liZe commented 5 years ago

Hoi, we had a similar performance issue with documents that accidentally contained lines which consisted of "words" about 100kB each.

That's a case that's not handled by our little shortcut here: https://github.com/Kozea/WeasyPrint/blob/840c4a6d923b5c9c97088f7acd1d30e1551cd789/weasyprint/text.py#L989-L998

This optimisation tries to render only the beginning of the text and checks that the line has been cut. If it's been cut, it means that we're OK with this first line and that we can keep it. If it's not cut, we cancel the optimisation and render the whole text instead.

It doesn't work for you because long words are never broken, so it renders the whole texts (minus the already rendered text) for each line.

I'm not sure this is a good fix for the general case, but at least in our case it reduces the rendering time for these problematic documents to seconds.

Unfortunately, it's not a good fix for everyone, for one major reason: you don't render the whole text, meaning that some text can be lost in the operation. For example, if your first line can handle more than (2.5 × font-size) characters, you'll lose the characters after this offset. And even if this lost text is displayed outside of the page boundaries, it has to be rendered as it is at least included in PDF metadata.

I agree that the split_first_line is the place that needs to be optimised, but I can't quite follow the algorithm so not sure which bit to tweak. It is clear that it expects words to be short.

The problem is that we're using Pango's high-level functions to find line cuts. These functions are known to be pretty slow, but they're useful to render whole paragraphs. What's ironic is that we never use it to render paragraphs, as it's too limited for our use case: Pango doesn't handle all the CSS features using its markup language.

A very good explanation about this can be found on Behdad Esfahbod's blog. Using Pango's low-level functions is probably the next step (before we only use Harfbuzz as other browsers do :innocent:). If anyone is interested (and also very, very patient), I can help!

grewn0uille commented 3 years ago

Hello!

(The survey is now closed. Thanks for all your answers! We’ll share the results soon 😉)

If you’re interested in better performances, we created a short survey where you can give a boost to this feature and help us to improve WeasyPrint 😉

Vote for it!

aschmitz commented 2 years ago

Other observations from digging through the profiling I did for #1587 for a bit:

I suspect I'd mostly be looking to improve things somewhere inside remake_page, and possibly by seeing if there's anything we can memoize on backgrounds or margin boxes. draw_text is 14% of our weight, and get_first_line takes about 10% total so if there are obvious things to do on either of those they could be impactful as well. (I have fantasized about replacing Pango's line breaking with something more purpose-built on HarfBuzz, but I doubt I'll have the time. That said, simply caching get_first_line calls might be of some use.)

But again, things look like there won't be many obvious quick wins in a "simple" document. I think there are a few gotchas around multi-column lists that are worth digging into, and they'll probably exist in some other edge cases, but that's more an effort of finding documents that hit them than "general" optimization.

liZe commented 2 years ago

@aschmitz Thanks a lot for finding and sharing these numbers.

Before discussing the details, here is some general information about performances for WeasyPrint, for everybody that would be interested in this topic.

WeasyPrint is quite slow by choice and by design, but it doesn’t mean at all that we can’t improve its speed. That’s the major point we’re working on for the next release, as it was one of the important topics for WeasyPrint’s users in the survey last year.

I personally use the samples of WeasyPerf to profile both the speed and the memory use. I render a document using the cProfile, and then use gprof2dot and Graphviz to get a nice SVG graph. There are lots of other tools for that, but that’s my favorite stack (so far)!

The samples are different enough to get very different results. It’s often very upsetting to get a nice 10% improvement with one sample, and then discover that it doesn’t change anything (or worse: you get slower results) for the other samples.

That’s really nice to have these numbers, probably with a large document (like in #1587) with a quite small stylesheet.

  • We spend a surprising amount of time (~10% after that optimization) in the __missing__ methods from the CSS module. I know we saved a bunch of time by not pushing down all calculations, but I wonder if there's a more efficient way of handling things. This includes about 1% of total time just in the self[key] = self.parent_style[key] line on AnonymousStyle (not even counting the subsequent time in ComputedStyle). I don't know if there's a good way to optimize this access, but it feels like there ought to be.

That’s a place where we can definitely improve performance. And that’s actually already the result of a large speed improvement, as we now use lazy dictionaries for CSS attributes that calculate the values (often using the values of their parents) on demand. Before that, we were calculating all the CSS properties’ values, even the ones that we don’t use. 10% is huge, but we call this function almost each time a CSS value has to be calculated. One solution would be to call it less often, maybe by pre-calculating some values that will for sure be useful. And of course, we can find tiny optimizations in the function code.

  • 5% of time is spent in get_all_computed styles.

This value can be much higher (+30%) with large stylesheets (for example with Bootstrap, but even with some Sphinx documentation as WeasyPrint’s doc). And we could totally improve the styling part when Python is able to provide "real" multithreading :).

Latest CSSSelect2 improvements give also nice results.

  • 40% of time is spent in layout_document -> make_all_pages -> remake_page, which feels a bit heavy.

Contrary to what we could think with the function name, remake_page is used to "make" pages, not only to "remake" them. That’s often where most of the time is spent.

  • 8% is in make_margin_boxes, which for this document mostly only changes with the page number. Perhaps there's some optimization / memoization that can be done for other parts?

I’ve not seen that with the samples, there’s definitely some improvements waiting to be found.

  • 3% is in layout_backgrounds, which are mostly blank (or one of a small number of other colors / images) and feel like a possible target for some optimization.

I’ve tried to work on that a few weeks ago, but the speed improvements were not worth the code complexity. I’ll probably want to try again later :).

  • 6% is in adding images (all PNGs), which might be able to be optimized.

1481 is already really nice, but we could find a better way.

  • 10% is in draw_text, most of that is in draw_first_line.

That’s a function I’ve tried to improve so many times. I regularly find little optimizations, but I’m never happy as I feel I’ve not found THE solution. I have to admit that currently, most of the time is actually spent in Pango, that we don’t call Pango functions too often according to the number of lines we have in a document, and…

Well. Creating a layout and using it to render only one line is so stupid. But that’s also why it’s "simple" (@Tontyna wouldn’t agree) and at least maintainable.

  • 1% is pydyf's write, which isn't bad. (Half of that is in deflate.)

Cairo didn’t help for that :).

  • < 1% of time is spent in subsetting + writing out the TTF, which feels practically reasonable.

fonttools/fonttools#2467 helped. We have larger percentages when a lot of fonts are embedded.

I suspect I'd mostly be looking to improve things somewhere inside remake_page, and possibly by seeing if there's anything we can memoize on backgrounds or margin boxes. draw_text is 14% of our weight, and get_first_line takes about 10% total so if there are obvious things to do on either of those they could be impactful as well. (I have fantasized about replacing Pango's line breaking with something more purpose-built on HarfBuzz, but I doubt I'll have the time. That said, simply caching get_first_line calls might be of some use.)

That’s a paragraph I could have written (well, I actually wrote bea6cef).

But again, things look like there won't be many obvious quick wins in a "simple" document. I think there are a few gotchas around multi-column lists that are worth digging into, and they'll probably exist in some other edge cases, but that's more an effort of finding documents that hit them than "general" optimization.

Multi-column and flex are sometimes terribly broken and terribly slow. But let’s keep some work for version 56 :).