Closed oliverbeck closed 8 years ago
@oliverbeck Thanks for opening this issue - performance is certainly a problem and there are a lot of inefficiencies in the code and also in the approach.
I know there are some improvements that could be made in the code - unfortunately, I haven't had much time to address them. I know, for example, diffing HTML tables is pretty expensive since it's now using DOMDocument.
There is one way to improve overall performance by using caching (as of v0.1.0). You can set a Cache Provider object on the HtmlDiffConfig
object, and it will cache the generated diffs.
Caching the generated diffs will really only improve performance if you're running the same content through HtmlDiff multiple times... so you will still likely have issues, especially if your content is changing frequently.
An exception to that would be if your content has a lot of HTML tables and lists. The way HTML tables and lists are diffed, the content of table cells or list items are passed to HtmlDiff
individually, so it would improve performance in situations where the oldText and newText of individual cells/ list items has been diffed before.
Basically... there are a lot of performance improvements that need to happen... and we just need to find the time to address them. We are totally open to taking contributions from the community as well, and I'm hoping we will start to generate more interest in that :smiley:
@jschroed91 Thank you for your reply!
Can you shortly discribe how I have to set a cache object?
Hi @oliverbeck can you try using the latest checkout?
Hi @SavageTiger
If I update to the latest version (0.1.2 or dev-master) I get this PHP error:
Fatal error: Unserialization of configuration schema failed, sha1 of file was da39a3ee5e6b4b0d3255bfef95601890afd80709 in /path/to/project/vendor/ezyang/htmlpurifier/library/HTMLPurifier/ConfigSchema.php on line 75
@oliverbeck That is weird, the problem seems to be that the HTMLPurifier cannot initialize its base configuration. I guess the best thing todo is delete the purifier dependency from the vendor folder and re-run composer install (or update). P.S. the performance inhancements are only present in the latest dev-master branch
@SavageTiger So you think the htmlpurifier isn't necessary at all?
@oliverbeck No it is, but if you re-run composer after deleting the ezlang/htmlpurifier folder it will reinstall the dependency.
ah ok now I understand :-) I'll try it out
Same error... and if I switch back to v0.1.1 it works well
That is really weird, can you paste your calling code, maybe a parameter you are using triggers this issue. (also what PHP version are you using?). Also the sha1sum for my HTMLPurifier scherma.ser is 496944b8892f2e4125c31e4f84d8211cb0cfc531
Here my code snippet:
$config = new HtmlDiffConfig();
$cacheProvider = new \Doctrine\Common\Cache\PhpFileCache(PATH_site . '/phpFileCache/');
$config->setCacheProvider($cacheProvider);
$htmlDiff = HtmlDiff::create($old, $new, $config);
$content = $htmlDiff->build();
as you can see, atm I use file cache. PHP version is 5.6.15
Ok i'll try to investigate locally. My HTMLPurifier is version 4.7.0
Just tried it out without CacheProvider, same error
You might also want to compare my schema.ser to yours.
still the same, v0.1.1 works, latest dev don't =/
@oliverbeck What happens if you replace your snipped for just
$htmlDiff = new HtmlDiff($old, $new, 'UTF-8', array());
$content = $htmlDiff->build();
unfortunately the same
Ok. I have not managed to replicate the issue locally. Thank you for taking the time and helping debug the issue.
Quick question: Did you replace the path in /path/to/project/vendor/ezyang for security reasons as I initially presumed? Or does it actually say /path/to/project? In case of the ladder, I might be on to something.
Yes, I replaced it for security reasons
Ok, bummer. I asked beceuse the sha1 da39a3ee5e6b4b0d3255bfef95601890afd80709 is the hash for an empty string. aka file not found or permission denied.
Bingo!
In ezyang/htmlpurifier/library/HTMLPurifier/ConfigSchema.php on line 75, it tries to get the file:
$contents = file_get_contents(HTMLPURIFIER_PREFIX . '/HTMLPurifier/ConfigSchema/schema.ser');
In my case the constant HTMLPURIFIER_PREFIX is empty (don't no why). If I set the correct path it works!
And for the toppic: The performance is much better, thank you!
@jschroed91 I think you can close this ticket :)
Hi,
First I would like to thank for your library. It seems to be the best solution for php at this moment. But I have one big problem with it: The HTML-Blocks I'm comparing have about 1000 words or 8000 chars (with tags). It works but it takes very long to process, sometimes about 2 minutes. I'm using PHP 5.6.
Are there any possibilities to enhance the performance?
thanks for any hint!
Kind regards, Oliver