caxy / php-htmldiff

A library for comparing two HTML files/snippets and highlighting the differences using simple HTML. Includes support for comparing complex lists and tables
http://php-htmldiff.caxy.com
GNU General Public License v2.0
202 stars 51 forks source link

Performance #38

Closed oliverbeck closed 8 years ago

oliverbeck commented 8 years ago

Hi,

First I would like to thank for your library. It seems to be the best solution for php at this moment. But I have one big problem with it: The HTML-Blocks I'm comparing have about 1000 words or 8000 chars (with tags). It works but it takes very long to process, sometimes about 2 minutes. I'm using PHP 5.6.

Are there any possibilities to enhance the performance?

thanks for any hint!

Kind regards, Oliver

jschroed91 commented 8 years ago

@oliverbeck Thanks for opening this issue - performance is certainly a problem and there are a lot of inefficiencies in the code and also in the approach.

I know there are some improvements that could be made in the code - unfortunately, I haven't had much time to address them. I know, for example, diffing HTML tables is pretty expensive since it's now using DOMDocument.

There is one way to improve overall performance by using caching (as of v0.1.0). You can set a Cache Provider object on the HtmlDiffConfig object, and it will cache the generated diffs.

Caching the generated diffs will really only improve performance if you're running the same content through HtmlDiff multiple times... so you will still likely have issues, especially if your content is changing frequently.

An exception to that would be if your content has a lot of HTML tables and lists. The way HTML tables and lists are diffed, the content of table cells or list items are passed to HtmlDiff individually, so it would improve performance in situations where the oldText and newText of individual cells/ list items has been diffed before.

Basically... there are a lot of performance improvements that need to happen... and we just need to find the time to address them. We are totally open to taking contributions from the community as well, and I'm hoping we will start to generate more interest in that :smiley:

oliverbeck commented 8 years ago

@jschroed91 Thank you for your reply!

Can you shortly discribe how I have to set a cache object?

SavageTiger commented 8 years ago

Hi @oliverbeck can you try using the latest checkout?

oliverbeck commented 8 years ago

Hi @SavageTiger

If I update to the latest version (0.1.2 or dev-master) I get this PHP error:

Fatal error: Unserialization of configuration schema failed, sha1 of file was da39a3ee5e6b4b0d3255bfef95601890afd80709 in /path/to/project/vendor/ezyang/htmlpurifier/library/HTMLPurifier/ConfigSchema.php on line 75

SavageTiger commented 8 years ago

@oliverbeck That is weird, the problem seems to be that the HTMLPurifier cannot initialize its base configuration. I guess the best thing todo is delete the purifier dependency from the vendor folder and re-run composer install (or update). P.S. the performance inhancements are only present in the latest dev-master branch

oliverbeck commented 8 years ago

@SavageTiger So you think the htmlpurifier isn't necessary at all?

SavageTiger commented 8 years ago

@oliverbeck No it is, but if you re-run composer after deleting the ezlang/htmlpurifier folder it will reinstall the dependency.

oliverbeck commented 8 years ago

ah ok now I understand :-) I'll try it out

oliverbeck commented 8 years ago

Same error... and if I switch back to v0.1.1 it works well

SavageTiger commented 8 years ago

That is really weird, can you paste your calling code, maybe a parameter you are using triggers this issue. (also what PHP version are you using?). Also the sha1sum for my HTMLPurifier scherma.ser is 496944b8892f2e4125c31e4f84d8211cb0cfc531

oliverbeck commented 8 years ago

Here my code snippet:

$config = new HtmlDiffConfig(); $cacheProvider = new \Doctrine\Common\Cache\PhpFileCache(PATH_site . '/phpFileCache/'); $config->setCacheProvider($cacheProvider); $htmlDiff = HtmlDiff::create($old, $new, $config); $content = $htmlDiff->build();

as you can see, atm I use file cache. PHP version is 5.6.15

SavageTiger commented 8 years ago

Ok i'll try to investigate locally. My HTMLPurifier is version 4.7.0

oliverbeck commented 8 years ago

Just tried it out without CacheProvider, same error

SavageTiger commented 8 years ago

You might also want to compare my schema.ser to yours.

schema.ser.zip

oliverbeck commented 8 years ago

still the same, v0.1.1 works, latest dev don't =/

SavageTiger commented 8 years ago

@oliverbeck What happens if you replace your snipped for just

$htmlDiff = new HtmlDiff($old, $new, 'UTF-8', array()); $content = $htmlDiff->build();

oliverbeck commented 8 years ago

unfortunately the same

SavageTiger commented 8 years ago

Ok. I have not managed to replicate the issue locally. Thank you for taking the time and helping debug the issue.

SavageTiger commented 8 years ago

Quick question: Did you replace the path in /path/to/project/vendor/ezyang for security reasons as I initially presumed? Or does it actually say /path/to/project? In case of the ladder, I might be on to something.

oliverbeck commented 8 years ago

Yes, I replaced it for security reasons

SavageTiger commented 8 years ago

Ok, bummer. I asked beceuse the sha1 da39a3ee5e6b4b0d3255bfef95601890afd80709 is the hash for an empty string. aka file not found or permission denied.

oliverbeck commented 8 years ago

Bingo!

In ezyang/htmlpurifier/library/HTMLPurifier/ConfigSchema.php on line 75, it tries to get the file:

$contents = file_get_contents(HTMLPURIFIER_PREFIX . '/HTMLPurifier/ConfigSchema/schema.ser');

In my case the constant HTMLPURIFIER_PREFIX is empty (don't no why). If I set the correct path it works!

oliverbeck commented 8 years ago

And for the toppic: The performance is much better, thank you!

SavageTiger commented 8 years ago

@jschroed91 I think you can close this ticket :)