Kozea / WeasyPrint

The awesome document factory
https://weasyprint.org
BSD 3-Clause "New" or "Revised" License
7.04k stars 668 forks source link

Memory consumption increases continuously when generating multiple PDFs in a Loop. #2130

Closed masahh closed 4 months ago

masahh commented 4 months ago

First of all, thank you very much for developing and maintaining this incredibly useful library!

We are encountering the following issues regarding memory consumption when converting HTML to PDF:

We have created a minimal setup to reproduce this problem: https://github.com/yamap55/weasyprint_memory_check (using Python==3.9.7 and 3.12.3, WeasyPrint==61.2, memory_profiler==0.61.0)

In the above repository, when running the container and executing python main.py:

liZe commented 4 months ago

Hi!

Thanks for your report.

WeasyPrint can take a lot of memory, that’s a known behavior and we’re open to solutions to improve this. But memory leaks is a different problem.

  • Approximately tens of MiBs increase in memory consumption per PDF generation when the HTML includes multibyte characters.

Many bug reports like this have already been open, and we have to be sure that it’s a real memory leak. A few (~20) generations is not enough to detect this, because Python’s interpreter can do what it wants with memory. There’s an interesting issue about this, showing that what may appear as a memory leak is not necessarily one: https://github.com/Kozea/WeasyPrint/issues/1977

So, you can try with 200+ generations and see if you’ve find a "real" memory leak.

That being said, your problem seems to be related to fonts, just as is #1977. Even if it’s not a memory leak, maybe there’s something we can do about this.

masahh commented 4 months ago

Thank you for your reply!

So, you can try with 200+ generations and see if you’ve find a "real" memory leak.

I tried generating 200 PDFs, and similar to #1977, the memory usage is stable from around the 80th iterations. (At that point, the usage was approximately 2.8 GiB.)

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     42.9 MiB     42.9 MiB           1   @profile()
     8                                         def create_pdf(file_path: str):
     9    116.0 MiB     73.1 MiB           1       generate(file_path)
    10    155.2 MiB     39.1 MiB           1       generate(file_path)
    11    195.4 MiB     40.2 MiB           1       generate(file_path)
    12    251.5 MiB     56.2 MiB           1       generate(file_path)
   [...]
    21    583.0 MiB     56.1 MiB           1       generate(file_path)
    22    639.2 MiB     56.2 MiB           1       generate(file_path)
    23    636.2 MiB     -3.0 MiB           1       generate(file_path)
    24    691.3 MiB     55.0 MiB           1       generate(file_path)
    25    749.5 MiB     58.2 MiB           1       generate(file_path)
    26    743.8 MiB     -5.6 MiB           1       generate(file_path)
    27    799.1 MiB     55.3 MiB           1       generate(file_path)
    28    857.0 MiB     57.8 MiB           1       generate(file_path)
    29    910.1 MiB     53.1 MiB           1       generate(file_path)
    30    889.1 MiB    -21.0 MiB           1       generate(file_path)
    31    944.3 MiB     55.2 MiB           1       generate(file_path)
    32    999.9 MiB     55.6 MiB           1       generate(file_path)
    33   1053.9 MiB     54.0 MiB           1       generate(file_path)
    34   1038.1 MiB    -15.9 MiB           1       generate(file_path)
    35   1093.4 MiB     55.3 MiB           1       generate(file_path)
    36   1146.3 MiB     53.0 MiB           1       generate(file_path)
   [...]
    72   2574.5 MiB     56.1 MiB           1       generate(file_path)
    73   2421.6 MiB   -152.8 MiB           1       generate(file_path)
    74   2491.7 MiB     70.1 MiB           1       generate(file_path)
    75   2547.2 MiB     55.6 MiB           1       generate(file_path)
    76   2603.2 MiB     55.9 MiB           1       generate(file_path)
    77   2658.5 MiB     55.3 MiB           1       generate(file_path)
    78   2714.4 MiB     55.9 MiB           1       generate(file_path)
    79   2770.6 MiB     56.3 MiB           1       generate(file_path)
    80   2825.3 MiB     54.6 MiB           1       generate(file_path)
    81   2880.6 MiB     55.3 MiB           1       generate(file_path)
    82   2936.4 MiB     55.8 MiB           1       generate(file_path)
    83   2456.2 MiB   -480.2 MiB           1       generate(file_path)
    84   2508.1 MiB     51.9 MiB           1       generate(file_path)
    85   2561.7 MiB     53.6 MiB           1       generate(file_path)
    86   2615.5 MiB     53.8 MiB           1       generate(file_path)
    87   2669.1 MiB     53.6 MiB           1       generate(file_path)
    88   2722.8 MiB     53.8 MiB           1       generate(file_path)
    89   2776.6 MiB     53.8 MiB           1       generate(file_path)
    90   2830.7 MiB     54.1 MiB           1       generate(file_path)
    91   2885.8 MiB     55.1 MiB           1       generate(file_path)
    92   2940.7 MiB     54.9 MiB           1       generate(file_path)
    93   2996.3 MiB     55.6 MiB           1       generate(file_path)
    94   2460.7 MiB   -535.6 MiB           1       generate(file_path)
    95   2511.5 MiB     50.7 MiB           1       generate(file_path)
    96   2565.1 MiB     53.6 MiB           1       generate(file_path)
   [...]
   188   2866.4 MiB     19.5 MiB           1       generate(file_path)
   189   2885.8 MiB     19.4 MiB           1       generate(file_path)
   190   2905.2 MiB     19.4 MiB           1       generate(file_path)
   191   2924.7 MiB     19.5 MiB           1       generate(file_path)
   192   2944.0 MiB     19.4 MiB           1       generate(file_path)
   193   2963.5 MiB     19.5 MiB           1       generate(file_path)
   194   2982.9 MiB     19.4 MiB           1       generate(file_path)
   195   2808.2 MiB   -174.7 MiB           1       generate(file_path)
   196   2827.6 MiB     19.4 MiB           1       generate(file_path)
   197   2846.9 MiB     19.4 MiB           1       generate(file_path)
   198   2866.4 MiB     19.5 MiB           1       generate(file_path)
   199   2885.8 MiB     19.4 MiB           1       generate(file_path)
   200   2905.2 MiB     19.4 MiB           1       generate(file_path)
   201   2924.7 MiB     19.5 MiB           1       generate(file_path)
   202   2944.1 MiB     19.4 MiB           1       generate(file_path)
   203   2963.6 MiB     19.5 MiB           1       generate(file_path)
   204   2983.1 MiB     19.5 MiB           1       generate(file_path)
   205   2808.3 MiB   -174.8 MiB           1       generate(file_path)
   206   2827.6 MiB     19.4 MiB           1       generate(file_path)
   207   2847.0 MiB     19.4 MiB           1       generate(file_path)
   208   2866.5 MiB     19.5 MiB           1       generate(file_path)
   209   2866.5 MiB      0.0 MiB           1       return True

Actually, we encountered memory errors in a container with a memory limit of 1-2GB. Therefore, it might be necessary to increase the memory limit for the container.

That being said, your problem seems to be related to fonts, just as is https://github.com/Kozea/WeasyPrint/issues/1977. Even if it’s not a memory leak, maybe there’s something we can do about this.

Regarding the font-related problem, I tried the suggested method below but it resulted in an error. It seems that it doesn't work with the type of font we are using. https://github.com/Kozea/WeasyPrint/issues/1977#issuecomment-1987002917

Code:

class PdfWriter:
    _fonts = {}

    def write_pdf(self, html_str):
        doc = weasyprint.HTML(string=html_str).render()
        doc.fonts = self._fonts
        doc.write_pdf(None)

PdfWriter().write_pdf(f"<div>{'<div>あ</div>' * 20}</div>")

Output:

Traceback (most recent call last):
  File "/app/main.py", line 36, in <module>
    create_pdf(html_str)
  File "/usr/local/lib/python3.9/site-packages/memory_profiler.py", line 1188, in wrapper
    val = prof(func)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/memory_profiler.py", line 761, in f
    return func(*args, **kwds)
  File "/app/main.py", line 10, in create_pdf
    generate(file_path)
  File "/app/main.py", line 28, in generate
    PdfWriter().write_pdf(html_str)
  File "/app/main.py", line 21, in write_pdf
    doc.write_pdf()
  File "/usr/local/lib/python3.9/site-packages/weasyprint/document.py", line 399, in write_pdf
    pdf = generate_pdf(self, target, zoom, **options)
  File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/__init__.py", line 268, in generate_pdf
    pdf_fonts = build_fonts_dictionary(
  File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/fonts.py", line 27, in build_fonts_dictionary
    font.clean(cmap, hinting)
  File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/stream.py", line 116, in clean
    subsetter.subset(self.ttfont)
  File "/usr/local/lib/python3.9/site-packages/fontTools/subset/__init__.py", line 3499, in subset
    self._subset_glyphs(font)
  File "/usr/local/lib/python3.9/site-packages/fontTools/subset/__init__.py", line 3423, in _subset_glyphs
    retain = table.subset_glyphs(self)
  File "/usr/local/lib/python3.9/site-packages/fontTools/subset/cff.py", line 111, in subset_glyphs
    del csi.file, csi.offsets
AttributeError: file
liZe commented 4 months ago

Regarding the font-related problem, I tried the suggested method below but it resulted in an error. It seems that it doesn't work with the type of font we are using.

The snippet is just a hack that could help in specific cases. We have yet to find a reliable way to fix this problem.

liZe commented 4 months ago

OK, I’ve found where the problem comes from:

https://github.com/Kozea/WeasyPrint/blob/3a208fe36b21b6dd3ee70ac55a0e0bb261686687/weasyprint/pdf/stream.py#L329-L340

This method is cached, meaning that the font is stored in memory once for each Font object. Of course, it’s important to have a cache to avoid to calculate the key multiple times for the same Pango font. But storing the generated font is stupid, because it way too big to be stored in memory.

Let’s just store the (Pango font + key) couple instead!

liZe commented 4 months ago

Before:

pmem(rss=49848320, vms=69750784, shared=16351232, text=4096, lib=0, data=34238464, dirty=0)
pmem(rss=142221312, vms=242700288, shared=23134208, text=4096, lib=0, data=136609792, dirty=0)
pmem(rss=208916480, vms=309448704, shared=22990848, text=4096, lib=0, data=203358208, dirty=0)
pmem(rss=282812416, vms=384045056, shared=22990848, text=4096, lib=0, data=277954560, dirty=0)
pmem(rss=344322048, vms=445583360, shared=22990848, text=4096, lib=0, data=339492864, dirty=0)
pmem(rss=407535616, vms=508080128, shared=22990848, text=4096, lib=0, data=401989632, dirty=0)
pmem(rss=478806016, vms=579670016, shared=22908928, text=4096, lib=0, data=473579520, dirty=0)
pmem(rss=538480640, vms=639197184, shared=22908928, text=4096, lib=0, data=533106688, dirty=0)
pmem(rss=598171648, vms=698810368, shared=22990848, text=4096, lib=0, data=592719872, dirty=0)
pmem(rss=662573056, vms=829816832, shared=22908928, text=4096, lib=0, data=657289216, dirty=0)
pmem(rss=734072832, vms=901210112, shared=22908928, text=4096, lib=0, data=728682496, dirty=0)
pmem(rss=794628096, vms=961871872, shared=22908928, text=4096, lib=0, data=789344256, dirty=0)
pmem(rss=858517504, vms=1025773568, shared=22908928, text=4096, lib=0, data=853245952, dirty=0)
pmem(rss=927875072, vms=1095065600, shared=22908928, text=4096, lib=0, data=922537984, dirty=0)
pmem(rss=986566656, vms=1154682880, shared=22908928, text=4096, lib=0, data=982155264, dirty=0)
pmem(rss=1058230272, vms=1225568256, shared=22908928, text=4096, lib=0, data=1053040640, dirty=0)
pmem(rss=1121529856, vms=1289117696, shared=22908928, text=4096, lib=0, data=1116590080, dirty=0)
pmem(rss=1191395328, vms=1359392768, shared=22908928, text=4096, lib=0, data=1186865152, dirty=0)
pmem(rss=1249349632, vms=1417478144, shared=22908928, text=4096, lib=0, data=1244950528, dirty=0)
pmem(rss=1311813632, vms=1479188480, shared=22908928, text=4096, lib=0, data=1306660864, dirty=0)
pmem(rss=1380007936, vms=1547632640, shared=22908928, text=4096, lib=0, data=1375105024, dirty=0)

After:

pmem(rss=49188864, vms=69419008, shared=16166912, text=4096, lib=0, data=33906688, dirty=0)
pmem(rss=102486016, vms=269824000, shared=22962176, text=4096, lib=0, data=96722944, dirty=0)
pmem(rss=106692608, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106500096, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106704896, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106528768, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106418176, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106651648, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107048960, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107053056, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107094016, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107057152, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107024384, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107048960, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107008000, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107200512, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107200512, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107204608, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)
iqbalhusen commented 4 months ago

OK, I’ve found where the problem comes from:

https://github.com/Kozea/WeasyPrint/blob/3a208fe36b21b6dd3ee70ac55a0e0bb261686687/weasyprint/pdf/stream.py#L329-L340

This method is cached, meaning that the font is stored in memory once for each Font object. Of course, it’s important to have a cache to avoid to calculate the key multiple times for the same Pango font. But storing the generated font is stupid, because it way too big to be stored in memory.

Let’s just store the (Pango font + key) couple instead!

This directed me to the right direction after struggling a whole day to solve a font caching related issue. When I was trying to generate a PDF from a static Japanese webpage in a loop, the first 2-3 PDFs were generated correctly. And after that the subsequent PDFs contained garbage texts and I was struggling to figure out what can be the reason, because it was using the same webpage. Then after I started to invalidate the cache after every PDF generation, the results were as expected. This is not mentioned in the documentation.

liZe commented 4 months ago

This is not mentioned in the documentation.

That’s because it’s a bug, and it’s fixed in the latest release. Update WeasyPrint and the problem will be gone.

See #2144 and #1977.