ezyang / htmlpurifier

Standards compliant HTML filter written in PHP
http://htmlpurifier.org
GNU Lesser General Public License v2.1
3.09k stars 330 forks source link

Cache memory leak when using PHP 7.4.x #270

Open eazrael opened 4 years ago

eazrael commented 4 years ago

I am currently investigation a memory leak with PHP 7.4.x. I upgraded from 7.2.26 to 7.4.10 and subsequently htmlpurifier from 4.10 to 4.13 as 4.10 is not PHP 7.4 compatible. Since then I have a huge issue with memory leaks, in my application a couple dozen of calls can leak ~512MB. I am still investigating the root cause, but I hope somebody has an idea what might happen. I will try to strip down my application to the minimum code required for reproducing the issue.

Things I found out so far:

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 20480 bytes) in /home/someproject/libs/htmlpurifier-4.12.0-lite/library/HTMLPurifier/DefinitionCache/Serializer.php on line 73
...
   45.9637  516876080  16. HTMLPurifier->purify(string(7), ???) /home/someproject/Classes/Util/HTMLSanitizer.php:164
   45.9637  516879240  17. HTMLPurifier_Generator->__construct(class HTMLPurifier_HTML5Config, class HTMLPurifier_Context) /home/someproject/libs/htmlpurifier-4.12.0-lite/library/HTMLPurifier.php:158
   45.9637  516879240  18. HTMLPurifier_HTML5Config->getHTMLDefinition(???, ???) /home/someproject/libs/htmlpurifier-4.12.0-lite/library/HTMLPurifier/Generator.php:74
   45.9637  516879240  19. HTMLPurifier_HTML5Config->getDefinition(string(4), false, false) /home/someproject/libs/htmlpurifier-4.12.0-lite/library/HTMLPurifier/Config.php:415
   45.9637  516879240  20. HTMLPurifier_HTML5Config->getDefinition(string(4), true, true) /home/someproject/libs/htmlpurifier-html5-master/library/HTMLPurifier/HTML5Config.php:86
   45.9637  516879240  21. HTMLPurifier_DefinitionCache_Decorator_Cleanup->get(class HTMLPurifier_HTML5Config) /home/someproject/libs/htmlpurifier-4.12.0-lite/library/HTMLPurifier/Config.php:579
   45.9637  516879240  22. HTMLPurifier_DefinitionCache_Decorator_Cleanup->get(class HTMLPurifier_HTML5Config) /home/someproject/libs/htmlpurifier-4.12.0-lite/library/HTMLPurifier/DefinitionCache/Decorator/Cleanup.php:70
   45.9637  516879240  23. HTMLPurifier_DefinitionCache_Serializer->get(class HTMLPurifier_HTML5Config) /home/someproject/libs/htmlpurifier-4.12.0-lite/library/HTMLPurifier/DefinitionCache/Decorator.php:81
   45.9640  517026992  24. unserialize(string(132328)) /home/someproject/libs/htmlpurifier-4.12.0-lite/library/HTMLPurifier/DefinitionCache/Serializer.php:73

config:

"AutoFormat.AutoParagraph" => true,
"AutoFormat.Linkify" => true,
"AutoFormat.RemoveEmpty" => true, 
"AutoFormat.RemoveSpansWithoutAttributes" => true,
"Core.RemoveProcessingInstructions" => true,
"URI.AllowedSchemes" => array (
    'http' => true,
    'https' => true,
    'mailto' => true
),
"URI.DefaultScheme" => "https",
"Output.TidyFormat" => true,
"HTML.ForbiddenAttributes" => array("class", "@data-community-tooltip"),
"HTML.ForbiddenElements" => [ "iframe", "form", "button", "input", "body", "html", "frameset", "head", "meta", "script", "style" ],
"Attr.ForbiddenClasses" => array("bb_ul", "bb_tag"),
"Core.CollectErrors" => true,
"Cache.SerializerPath" => "/some/path"

More info will follow.

jahrralf commented 3 years ago

I have the same issue on 7.4, we send all emails through HTML purifier and normally the process stops at 2500 emails (then a few hundred MB memory). With $config->set('Cache.DefinitionImpl', null);, memory consumption stays low (14 MB) but it is not as fast.

PHPGangsta commented 3 years ago

I'm in the process of switching to 7.4, and I was worried about your issue regarding memory consumption/leak. So I'm testing, and I cannot reproduce your problem.

I have 84 .eml files of HTML emails, 953 MB in total. The biggest file is 52 MB, the smallest 3 KB

4 testcases in total:

PHP 7.1 HTMLPurifier 4.9.3 PHP 7.1 HTMLPurifier 4.13.0 PHP 7.4 HTMLPurifier 4.9.3 PHP 7.4 HTMLPurifier 4.13.0

As you can see below, all 4 have roughly the same memory usage. PHP 7.4 uses 1% more memory, but is 10% faster. (The Deprecation notices in case 3 are expected, HTMLPurifier 4.9.3 is not compatible with PHP 7.4)

$ /usr/bin/php7.1 htmlpurify1.php old
PHP Version: 7.1.33-34+ubuntu18.04.1+deb.sury.org+1
HTMLPurifier Version: 4.9.3
Memory Usage: 195.29 MB
Memory Real Usage: 213.36 MB
Seconds: 40.261646032333
$ /usr/bin/php7.1 htmlpurify1.php new
PHP Version: 7.1.33-34+ubuntu18.04.1+deb.sury.org+1
HTMLPurifier Version: 4.13.0
Memory Usage: 195.35 MB
Memory Real Usage: 213.36 MB
Seconds: 41.45220208168
$ /usr/bin/php7.4 htmlpurify1.php old

Deprecated: Array and string offset access syntax with curly braces is deprecated in /htmlpurifier-4.9.3/library/HTMLPurifier/Encoder.php on line 162

Deprecated: Array and string offset access syntax with curly braces is deprecated in /htmlpurifier-4.9.3/library/HTMLPurifier/ChildDef/Custom.php on line 48

Deprecated: Array and string offset access syntax with curly braces is deprecated in /htmlpurifier-4.9.3/library/HTMLPurifier/TagTransform/Font.php on line 78

Deprecated: Array and string offset access syntax with curly braces is deprecated in /htmlpurifier-4.9.3/library/HTMLPurifier/TagTransform/Font.php on line 78

Deprecated: __autoload() is deprecated, use spl_autoload_register() instead in /htmlpurifier-4.9.3/library/HTMLPurifier.autoload.php on line 17
PHP Version: 7.4.21
HTMLPurifier Version: 4.9.3
Memory Usage: 196.16 MB
Memory Real Usage: 215.45 MB
Seconds: 35.772937059402
$ /usr/bin/php7.4 htmlpurify1.php new
PHP Version: 7.4.21
HTMLPurifier Version: 4.13.0
Memory Usage: 196.22 MB
Memory Real Usage: 215.45 MB
Seconds: 36.35814499855

If I just purify the largest 52 MB file, I get these numbers:

$ /usr/bin/php7.4 htmlpurify1.php new
PHP Version: 7.4.21
HTMLPurifier Version: 4.13.0
Memory Usage: 184.86 MB
Memory Real Usage: 186.09 MB
Seconds: 0.97783088684082

Purifying 953 MB instead of 52 MB is increasing the memory a bit, but not that much.

If I disable the cache $config->set('Cache.DefinitionImpl', null); it does not change anything, the memory consumption and runtime is the same. If I enable the Cache, it generates .ser files. But at least in my case it does not bring any performance improvements...

@eazrael @jahrralf Can you provide testfiles, so I can reproduce your performance problems?

jahrralf commented 3 years ago

Sorry - I cannot provide test data.