Closed masahh closed 4 months ago
Hi!
Thanks for your report.
WeasyPrint can take a lot of memory, that’s a known behavior and we’re open to solutions to improve this. But memory leaks is a different problem.
- Approximately tens of MiBs increase in memory consumption per PDF generation when the HTML includes multibyte characters.
Many bug reports like this have already been open, and we have to be sure that it’s a real memory leak. A few (~20) generations is not enough to detect this, because Python’s interpreter can do what it wants with memory. There’s an interesting issue about this, showing that what may appear as a memory leak is not necessarily one: https://github.com/Kozea/WeasyPrint/issues/1977
So, you can try with 200+ generations and see if you’ve find a "real" memory leak.
That being said, your problem seems to be related to fonts, just as is #1977. Even if it’s not a memory leak, maybe there’s something we can do about this.
Thank you for your reply!
So, you can try with 200+ generations and see if you’ve find a "real" memory leak.
I tried generating 200 PDFs, and similar to #1977, the memory usage is stable from around the 80th iterations. (At that point, the usage was approximately 2.8 GiB.)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 42.9 MiB 42.9 MiB 1 @profile()
8 def create_pdf(file_path: str):
9 116.0 MiB 73.1 MiB 1 generate(file_path)
10 155.2 MiB 39.1 MiB 1 generate(file_path)
11 195.4 MiB 40.2 MiB 1 generate(file_path)
12 251.5 MiB 56.2 MiB 1 generate(file_path)
[...]
21 583.0 MiB 56.1 MiB 1 generate(file_path)
22 639.2 MiB 56.2 MiB 1 generate(file_path)
23 636.2 MiB -3.0 MiB 1 generate(file_path)
24 691.3 MiB 55.0 MiB 1 generate(file_path)
25 749.5 MiB 58.2 MiB 1 generate(file_path)
26 743.8 MiB -5.6 MiB 1 generate(file_path)
27 799.1 MiB 55.3 MiB 1 generate(file_path)
28 857.0 MiB 57.8 MiB 1 generate(file_path)
29 910.1 MiB 53.1 MiB 1 generate(file_path)
30 889.1 MiB -21.0 MiB 1 generate(file_path)
31 944.3 MiB 55.2 MiB 1 generate(file_path)
32 999.9 MiB 55.6 MiB 1 generate(file_path)
33 1053.9 MiB 54.0 MiB 1 generate(file_path)
34 1038.1 MiB -15.9 MiB 1 generate(file_path)
35 1093.4 MiB 55.3 MiB 1 generate(file_path)
36 1146.3 MiB 53.0 MiB 1 generate(file_path)
[...]
72 2574.5 MiB 56.1 MiB 1 generate(file_path)
73 2421.6 MiB -152.8 MiB 1 generate(file_path)
74 2491.7 MiB 70.1 MiB 1 generate(file_path)
75 2547.2 MiB 55.6 MiB 1 generate(file_path)
76 2603.2 MiB 55.9 MiB 1 generate(file_path)
77 2658.5 MiB 55.3 MiB 1 generate(file_path)
78 2714.4 MiB 55.9 MiB 1 generate(file_path)
79 2770.6 MiB 56.3 MiB 1 generate(file_path)
80 2825.3 MiB 54.6 MiB 1 generate(file_path)
81 2880.6 MiB 55.3 MiB 1 generate(file_path)
82 2936.4 MiB 55.8 MiB 1 generate(file_path)
83 2456.2 MiB -480.2 MiB 1 generate(file_path)
84 2508.1 MiB 51.9 MiB 1 generate(file_path)
85 2561.7 MiB 53.6 MiB 1 generate(file_path)
86 2615.5 MiB 53.8 MiB 1 generate(file_path)
87 2669.1 MiB 53.6 MiB 1 generate(file_path)
88 2722.8 MiB 53.8 MiB 1 generate(file_path)
89 2776.6 MiB 53.8 MiB 1 generate(file_path)
90 2830.7 MiB 54.1 MiB 1 generate(file_path)
91 2885.8 MiB 55.1 MiB 1 generate(file_path)
92 2940.7 MiB 54.9 MiB 1 generate(file_path)
93 2996.3 MiB 55.6 MiB 1 generate(file_path)
94 2460.7 MiB -535.6 MiB 1 generate(file_path)
95 2511.5 MiB 50.7 MiB 1 generate(file_path)
96 2565.1 MiB 53.6 MiB 1 generate(file_path)
[...]
188 2866.4 MiB 19.5 MiB 1 generate(file_path)
189 2885.8 MiB 19.4 MiB 1 generate(file_path)
190 2905.2 MiB 19.4 MiB 1 generate(file_path)
191 2924.7 MiB 19.5 MiB 1 generate(file_path)
192 2944.0 MiB 19.4 MiB 1 generate(file_path)
193 2963.5 MiB 19.5 MiB 1 generate(file_path)
194 2982.9 MiB 19.4 MiB 1 generate(file_path)
195 2808.2 MiB -174.7 MiB 1 generate(file_path)
196 2827.6 MiB 19.4 MiB 1 generate(file_path)
197 2846.9 MiB 19.4 MiB 1 generate(file_path)
198 2866.4 MiB 19.5 MiB 1 generate(file_path)
199 2885.8 MiB 19.4 MiB 1 generate(file_path)
200 2905.2 MiB 19.4 MiB 1 generate(file_path)
201 2924.7 MiB 19.5 MiB 1 generate(file_path)
202 2944.1 MiB 19.4 MiB 1 generate(file_path)
203 2963.6 MiB 19.5 MiB 1 generate(file_path)
204 2983.1 MiB 19.5 MiB 1 generate(file_path)
205 2808.3 MiB -174.8 MiB 1 generate(file_path)
206 2827.6 MiB 19.4 MiB 1 generate(file_path)
207 2847.0 MiB 19.4 MiB 1 generate(file_path)
208 2866.5 MiB 19.5 MiB 1 generate(file_path)
209 2866.5 MiB 0.0 MiB 1 return True
Actually, we encountered memory errors in a container with a memory limit of 1-2GB. Therefore, it might be necessary to increase the memory limit for the container.
That being said, your problem seems to be related to fonts, just as is https://github.com/Kozea/WeasyPrint/issues/1977. Even if it’s not a memory leak, maybe there’s something we can do about this.
Regarding the font-related problem, I tried the suggested method below but it resulted in an error. It seems that it doesn't work with the type of font we are using. https://github.com/Kozea/WeasyPrint/issues/1977#issuecomment-1987002917
Code:
class PdfWriter:
_fonts = {}
def write_pdf(self, html_str):
doc = weasyprint.HTML(string=html_str).render()
doc.fonts = self._fonts
doc.write_pdf(None)
PdfWriter().write_pdf(f"<div>{'<div>あ</div>' * 20}</div>")
Output:
Traceback (most recent call last):
File "/app/main.py", line 36, in <module>
create_pdf(html_str)
File "/usr/local/lib/python3.9/site-packages/memory_profiler.py", line 1188, in wrapper
val = prof(func)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/memory_profiler.py", line 761, in f
return func(*args, **kwds)
File "/app/main.py", line 10, in create_pdf
generate(file_path)
File "/app/main.py", line 28, in generate
PdfWriter().write_pdf(html_str)
File "/app/main.py", line 21, in write_pdf
doc.write_pdf()
File "/usr/local/lib/python3.9/site-packages/weasyprint/document.py", line 399, in write_pdf
pdf = generate_pdf(self, target, zoom, **options)
File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/__init__.py", line 268, in generate_pdf
pdf_fonts = build_fonts_dictionary(
File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/fonts.py", line 27, in build_fonts_dictionary
font.clean(cmap, hinting)
File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/stream.py", line 116, in clean
subsetter.subset(self.ttfont)
File "/usr/local/lib/python3.9/site-packages/fontTools/subset/__init__.py", line 3499, in subset
self._subset_glyphs(font)
File "/usr/local/lib/python3.9/site-packages/fontTools/subset/__init__.py", line 3423, in _subset_glyphs
retain = table.subset_glyphs(self)
File "/usr/local/lib/python3.9/site-packages/fontTools/subset/cff.py", line 111, in subset_glyphs
del csi.file, csi.offsets
AttributeError: file
Regarding the font-related problem, I tried the suggested method below but it resulted in an error. It seems that it doesn't work with the type of font we are using.
The snippet is just a hack that could help in specific cases. We have yet to find a reliable way to fix this problem.
OK, I’ve found where the problem comes from:
This method is cached, meaning that the font is stored in memory once for each Font
object. Of course, it’s important to have a cache to avoid to calculate the key multiple times for the same Pango font. But storing the generated font is stupid, because it way too big to be stored in memory.
Let’s just store the (Pango font + key) couple instead!
Before:
pmem(rss=49848320, vms=69750784, shared=16351232, text=4096, lib=0, data=34238464, dirty=0)
pmem(rss=142221312, vms=242700288, shared=23134208, text=4096, lib=0, data=136609792, dirty=0)
pmem(rss=208916480, vms=309448704, shared=22990848, text=4096, lib=0, data=203358208, dirty=0)
pmem(rss=282812416, vms=384045056, shared=22990848, text=4096, lib=0, data=277954560, dirty=0)
pmem(rss=344322048, vms=445583360, shared=22990848, text=4096, lib=0, data=339492864, dirty=0)
pmem(rss=407535616, vms=508080128, shared=22990848, text=4096, lib=0, data=401989632, dirty=0)
pmem(rss=478806016, vms=579670016, shared=22908928, text=4096, lib=0, data=473579520, dirty=0)
pmem(rss=538480640, vms=639197184, shared=22908928, text=4096, lib=0, data=533106688, dirty=0)
pmem(rss=598171648, vms=698810368, shared=22990848, text=4096, lib=0, data=592719872, dirty=0)
pmem(rss=662573056, vms=829816832, shared=22908928, text=4096, lib=0, data=657289216, dirty=0)
pmem(rss=734072832, vms=901210112, shared=22908928, text=4096, lib=0, data=728682496, dirty=0)
pmem(rss=794628096, vms=961871872, shared=22908928, text=4096, lib=0, data=789344256, dirty=0)
pmem(rss=858517504, vms=1025773568, shared=22908928, text=4096, lib=0, data=853245952, dirty=0)
pmem(rss=927875072, vms=1095065600, shared=22908928, text=4096, lib=0, data=922537984, dirty=0)
pmem(rss=986566656, vms=1154682880, shared=22908928, text=4096, lib=0, data=982155264, dirty=0)
pmem(rss=1058230272, vms=1225568256, shared=22908928, text=4096, lib=0, data=1053040640, dirty=0)
pmem(rss=1121529856, vms=1289117696, shared=22908928, text=4096, lib=0, data=1116590080, dirty=0)
pmem(rss=1191395328, vms=1359392768, shared=22908928, text=4096, lib=0, data=1186865152, dirty=0)
pmem(rss=1249349632, vms=1417478144, shared=22908928, text=4096, lib=0, data=1244950528, dirty=0)
pmem(rss=1311813632, vms=1479188480, shared=22908928, text=4096, lib=0, data=1306660864, dirty=0)
pmem(rss=1380007936, vms=1547632640, shared=22908928, text=4096, lib=0, data=1375105024, dirty=0)
After:
pmem(rss=49188864, vms=69419008, shared=16166912, text=4096, lib=0, data=33906688, dirty=0)
pmem(rss=102486016, vms=269824000, shared=22962176, text=4096, lib=0, data=96722944, dirty=0)
pmem(rss=106692608, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106500096, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106704896, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106528768, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106418176, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106651648, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107048960, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107053056, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107094016, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107057152, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107024384, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107048960, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107008000, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107200512, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107200512, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107204608, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)
OK, I’ve found where the problem comes from:
This method is cached, meaning that the font is stored in memory once for each
Font
object. Of course, it’s important to have a cache to avoid to calculate the key multiple times for the same Pango font. But storing the generated font is stupid, because it way too big to be stored in memory.Let’s just store the (Pango font + key) couple instead!
This directed me to the right direction after struggling a whole day to solve a font caching related issue. When I was trying to generate a PDF from a static Japanese webpage in a loop, the first 2-3 PDFs were generated correctly. And after that the subsequent PDFs contained garbage texts and I was struggling to figure out what can be the reason, because it was using the same webpage. Then after I started to invalidate the cache after every PDF generation, the results were as expected. This is not mentioned in the documentation.
This is not mentioned in the documentation.
That’s because it’s a bug, and it’s fixed in the latest release. Update WeasyPrint and the problem will be gone.
See #2144 and #1977.
First of all, thank you very much for developing and maintaining this incredibly useful library!
We are encountering the following issues regarding memory consumption when converting HTML to PDF:
We have created a minimal setup to reproduce this problem: https://github.com/yamap55/weasyprint_memory_check (using Python==3.9.7 and 3.12.3, WeasyPrint==61.2, memory_profiler==0.61.0)
In the above repository, when running the container and executing
python main.py
:memory_profiler
.Is this behavior expected? Or are there any methods to reduce memory usage in this scenario?