Large Website "Rebuilding" Improvements

HOFB2015 commented 8 years ago

I have a large Grav site with 5000 pages. Once the caches are built the site is running fine without any issues... I deploy to my site daily with updates... and the caches always need to be cleared for pages to pickup the changes. This can take forever in some cases and the website shows timeout errors whilst this is rebuilding.

BarinsGhost commented 8 years ago

Would also like to know the answer to this question.

mewcrazy commented 8 years ago

What I noticed with one of my large websites (15000 pages), it showed me 404's for the first access on a page, then it was cached and all future accesses were fine. Not sure where the problem was, but my other small Grav sites are not acting the same.

BarinsGhost commented 8 years ago

mewcrazy, mind linking your sites here? I would like to take a look at them, as I am getting myself familiar with Grav and want to see what is possible and others are doing with it.

rhukster commented 8 years ago

Just a note that I asked @HOFB2015 to post this issue as I have some ideas that could help these large sites. They are quite far-reaching, and will involve some significant rework and testing, so I just wanted this issue created so I don't forget about it.

hwmaier commented 8 years ago

I am running a site with about 200 pages and fiddled also around to get page updates working faster. The first significant improvement was within the PHP settings. As Grav is scanning a lot of files, changing the php.ini setting realpath_cache_size from default 16K to 1M gave my site a big performance boost. The load time of grav (caching disabled) changes form 1.2s to 0.7s, so this is 40% faster.

The second significant improvement on the same site was changing the YAML parser from Symfony/Yaml to PECL Yaml which is a native parser. I installed this as an PHP PECL extension on my server and hacked the Symfony Yaml libray to use PECL instead. Now script execution time is only 0.35s another 50% improvement.

Result, those two changes slashed execution time from 1.2s down to 0.35s, a quarter!

This is the hack I applied to vendor\symfony\yaml\Yaml.php, just replace function parse() with the following code, changes are marked with HM:

    public static function parse($input, $exceptionOnInvalidType = false, $objectSupport = false, $objectForMap = false)
    {
        // if input is a file, process it
        $file = '';
        if (strpos($input, "\n") === false && is_file($input)) {
            @trigger_error('The ability to pass file names to the '.__METHOD__.' method is deprecated since version 2.2 and will be removed in 3.0. Pass the YAML contents of the file instead.', E_USER_DEPRECATED);

            if (false === is_readable($input)) {
                throw new ParseException(sprintf('Unable to parse "%s" as the file is not readable.', $input));
            }

            $file = $input;
            $input = file_get_contents($file);
//HM+            
            return yaml_parse_file($input);
//HM-            
         }
//HM+
// Check if PECL Yaml is installed and use that instead of Symfony Yaml
if (function_exists('yaml_parse')) {
    $data = preg_replace("/ (@[\w\.\-]*)/", " '\${1}'", $input); // Fix illegal @ start character issue
    $data = @yaml_parse("---\n" . $data . "\n...");     
    if ($data) {
        return $data;
    } else {
        // Otherwise fallback to Symfony parser
        // for the admin translations which have a leading % for some entries
    }
}
//HM-
        $yaml = new Parser();

        try {
            return $yaml->parse($input, $exceptionOnInvalidType, $objectSupport, $objectForMap);
        } catch (ParseException $e) {
            if ($file) {
                $e->setParsedFile($file);
            }

            throw $e;
        }
    }

In your php.ini file add those lines:

; YAML PECL extension
extension=php_yaml.dll

;;realpath_cache_size = 16k
realpath_cache_size =  1M

I would be curious to see the impact on your 5000+ site.

hwmaier commented 8 years ago

Further on this topic, using the Dipper YAML parser is also a great improvement over Symfony/YAML. Using the same test site as above, using Dipper parses the site tree in 43s. So much faster then Symfony but as fast a the native PECL Yaml parser. Compared to the PECL' Yaml solution Dipper does not bark at leading @ and % characters which would make it a nice choice for Grav I think.

hwmaier commented 8 years ago

Sorry, I must revise my initial excitement about Dipper. While it is fast, its not a robust parser and completely breaks Grav's admin plugin and quite a few other pages.

This Yaml

related:
    []

becomes with symfony/Yaml correctly:

[related] => Array
    (
    )

and with Dipper wrongly:

[related] => Array
    (
        [0] =>
    )

Dipper also fails miserably for all the form definitions (eg admin plugin):

fields:
    - name: name
      label: Name
      placeholder: Enter your name
      autofocus: on
      autocomplete: on
      type: text
      validate:
        required: true
        pattern: ".{5,10}"
        title: 5 to 10 characters

becomes:

[fields] => Array
    (
        [0] => Array
            (
                [name] => Array
                    (
                        [namelabel] => Array
                            (
                                [Nameplaceholder] => Array
                                    (
                                        [Enter your nameautofocus] => Array
                                            (
                                                [onautocomplete] => Array
                                                    (
                                                        [ontype] => textvalidate:
                                                    )

                                            )

                                    )

                            )

                    )

                [required] => 1
                [pattern] => .{5,10}
                [title] => 5 to 10 characters
            )

    )

instead of

[fields] => Array
    (
        [0] => Array
            (
                [name] => name
                [label] => Name
                [placeholder] => Enter your name
                [autofocus] => 1
                [autocomplete] => 1
                [type] => text
                [validate] => Array
                    (
                        [required] => 1
                        [pattern] => .{5,10}
                        [title] => 5 to 10 characters
                    )

            )

    )

While Grav's Yaml files could be re-factored to be parsable by Dipper, I wonder if this would be the correct approach. The carrot is the speed though...

hwmaier commented 8 years ago

The best speed improvement I gained is still the PECL Yaml parser which appears to be more popular than the PECL syck parser. syck has not been maintained since 2008 and is still only Yaml 1.0. PECL Yaml is compliant with Yaml 1.1 at least and was last updated in 2015, so its an active project.

I think that one would be a viable fall-back option for servers which have this extension installed. A pre-compiled binary for WAMP is available and even my shared web hosting account allows to install this PECL extension.

There are only two issues, the % and @ leading characters in Grav. The % character could be fixed by refactoring the language.yaml file and wrapping the concerned lines with quotes. The @ issue could be fixed by a simple preg_replace("/ (@[\w\.\-]*)/", " '\${1}'", $input); before parsing the yaml.

I added this to a pull request here: https://github.com/rockettheme/toolbox/pull/3

rhukster commented 8 years ago

Thanks for the research on this. Three questions for you:

1) How much of a performance slowdown is the preg_replace() adding? 2) It might be less of a hit if that check was done in the Page object on the header rather than in toolbox as that is the only place the @ syntax is used. 3) if #2 is quite a hit, maybe a simple strpos(': @') check could be done first to see if the regex is even required?

hwmaier commented 8 years ago

1) In my test environment I could not measure any noticeable difference between preg_replace and str_replace. Before using the regex I had something like this:

$data = str_replace(['@self.children', '@self.modular', '@page', '@taxonomy'],
                    ["'@self.children'", "'@self.modular'", "'@page'", "'@taxonomy'"], $input);

but then thought the regex is more flexible in case Grav comes up with more commands. 2) Toolbox uses the @ for @data-options: '\Grav\Common\Utils::timezones' but this could be changed in the core files with quotes to '@data-options': '\Grav\Common\Utils::timezones' 3) I think a simple simple strpos(' @') test is a good idea before applying the regex.

From a profiling perspective, the Yaml parsing and the directory recursion are the two biggest time-eaters, with Yaml parsing the biggest contributor.

hwmaier commented 8 years ago

I did some more profiling, just out of interest. Parsing of 10000 Yaml headers:

no str_replace/preg_replace: 0.35ms strpos guard and nothing to replace: 0.39ms preg_replace and nothing to replace: 0.40ms str_replace and nothing to replace: 0.41ms str_replace and 2 replacements: 0.42ms str_replace and strpos guard w/ 2 replacements: 0.44ms preg_replace and 2 replacements: 0.45ms preg_replace and strpos guard w/ 2 replacements: 0.47ms

I think the benefit of str_replace vs preg_replace is minimal. Even the benefit of the strpos guard would depend on the ratio of pages with @ in header and without.

mewcrazy commented 8 years ago

I've stopped using GravCMS for my site with 15000 products. But if you want I can zip it up and share it. But its basically not much more than a blog page with 15000 subdirectories and an item.md inside with a title, a slug, a publish date and some 'Lorem Ipsum' text.

mahagr commented 8 years ago

@mewcrazy Please share your site as it allows us to have "real" test data when making performance optimizations. I have some ideas on how to improve performance a lot in huge sites, but it has been really hard to test out those ideas without having the proper dataset.

@hwmaier I agree that there's not much point of having strpos guard. Though its really tempting to get that 30% speed increase during parsing by getting rid of the bad yaml input...

rhukster commented 8 years ago

Thanks for all your great research and information! We are going to definitely look at this in more detail.

mahagr commented 8 years ago

Alright, I think we have a good plan now. I'm going to merge the change to toolbox and add some options that allows us to get most out of it.

hwmaier commented 8 years ago

Is is correct to say that Grav has evolved already too far that changing from @ to something different is impossible? The @self.modular is almost a signature concept now of Grav isn't it?

hwmaier commented 8 years ago

One improvement over the always-rebuild-complete-tree-if-something-has-changed could be not to rebuild the whole tree if only page content was updated but the header remained untouched.

Assuming most changes are content edits rather structural changes, this would help already a lot for maintaining larger sites.

rhukster commented 8 years ago

Yes the @self, @modular, @taxonomy, etc concepts are pretty ingrained at this point, and it's not even something internal, it's something that users will have in their pages. This means that it's not easily changed.

We have some ideas though. We are going to have the regex be a toggle, so if you know your site is safe, you can disable the regex cleanup routine. We will also be able to fix this when saving pages in the admin. We have also talked about having a diagnostic page in the admin that could even 'fix' all the instances found.

Lots of options. Also we can of course update all teh skeletons and other things so that going forward, everything is wrapped in quotes and is safe. At some point in the future we could change the default behavior of the regex to 'off' and either log an error, and/or auto fallback to symfony and log that is happening.

rhukster commented 8 years ago

Regarding your idea of the complete tree rebuild, by idea was similar, but slightly different.

First there are three considerations:

1) The pages are checked for modifications. This happens based on latest timestamp, so the pages are recursed and the latest timestamp is found, if this timestamp is different from the one expected, page headers are re-processed. Note, we don't look in the files themselves at this point for performance reasons. We also have file level and folder level checking, as well as no checking.

2) if re-processing, all the pages are recursed and page headers are reprocessed.

3) Content is not cached until ->content() is called for that particular page. It is assumed that any change to the page will modify the timestamp and trigger a processing to clear the cache.

Now #1 is quite fast, but can slow down for bigger sites with lots of pages, and I recommend that people with those large sites set this check to none and manually clear cache when needed. If you can think of a faster more efficient way to do this, i'm all ears.

Regarding #2, this is clearly not very efficient, but is generally not a big deal on smaller sites, as the results are cached. Also this is very 'safe' because any little thing will cause a fresh cache so you don't get any weird states.

What I am thinking, and maybe as an optional mode (for larger sites). is to have a differential cache update. This would modify both #1 and #2 so that every page has a timestamp associated with it, and the #1 part where pages are checked, the list of modified pages is created. Then when it reprocesses, only the modified files would get reprocessed and recached. rather than the whole tree.

This would obviously reduce overhead as modifying one page would have minimal recaching overhead.

The other thing to consider is related to configuration. Configuration can have a direct impact for pages, as often it is taken into account in the page logic. I would need to really delve into that again to see if any of that is relevant at processing time or not. If it does, then any config page would need to trigger all the pages to be reprocessed again. if it's at runtime only, then only the config portions of the cache would need to be reprocessed.

As you can see this does make things quite a bit more complex and there are probably scenarios i've not even considered yet :)

hwmaier commented 8 years ago

Honestly, please do not focus on the regex issue. I don't think ts worth any effort.

It is a minimal overhead and beyond the measurement noise. For a site with 10000 pages the measured overhead is 50 milliseconds only for the hole site! In real world scenarios the impact is even less as not all pages will have the @ entry.

I also would not implement a regex toogle or diagnostic info about it and confuse the user with a total internal issue which can be hidden.

And be aware that probably 90% of users will not even install the PHP Yaml extension anyway because they are on shared hosts (some allow PECL installs, some don't) , so those will use symfony/Yaml anyway.

The optimising focus should be on deciding which pages need updates and on a scheme to do partial tree updates, the way you described it in previous post.

rhukster commented 8 years ago

Fair points, however toolbox could be used for other things that would never have a need for this regex so no point having the regex for those. We've already added a config option that you can use to toggle this as needed.

That said, if the difference really is negligable, Grav will probably just set this option to ensure the regex is enabled.

hwmaier commented 8 years ago

Or one factors out the Yaml::parse routine into a separate adapter class which can have a Grav specific incarnation and a generic incarnation so toolbox remains untainted with Grav specifics.

It is also sad that Dipper is not up to the job as it looked promising. I wonder if it would be worthwhile to submit a bugreport hoping that the maintainers would look into it.

rhukster commented 8 years ago

I would say if it looks like Dipper is still being updated/maintained, then yes, would be nice to get them to fix their bugs so we could use it by default.

hwmaier commented 8 years ago

I opened a ticket https://github.com/secondparty/dipper/issues/9. Let's see whats happening.

rhukster commented 8 years ago

Nice, i'm watching it too. Thanks!

rhukster commented 8 years ago

Doesn't bode well though that the author has not made a single commit in GitHub since February! https://github.com/fredleblanc

nazwa commented 8 years ago

@mewcrazy purely out of interest, what type of site do you run with so many pages? Are you happy with Grav over alternatives for such big project? Mainly in terms of management?

mewcrazy commented 8 years ago

Hey, Sorry for the late answer. The page I was talking about was a blog with 17.000 blog posts. Managing those via your GUI at that time was impossible. Especially things like setting a home page in the admin was a pain in the ass. I had to do it manually, because the GUI only gave me a select box/dropdown, but my browser had troubles rendering it. No wonder, since it has 17.000

And as far as I remember I noticed a big performance downfall with that much entries. Especially if I tried to visit a page that wasn't cached yet. But for small and medium sites GravCMS is super nice.

For me as a developer I didn't really like the fact that I literally could use no php at all in your template files. I ended up using a custom Twig-Plugin which allowed me to execute php functions like this: {php functionname()} or so. So every output of mine required it's own function. This was too much trouble and used too much time, so that I ended up coding my own CMS and for another site at that time I used WordPress.

But I'm sure I will try out again some day, hopefully you give developers some easy way to write their own php. I mean does anyone really like Twig? There is so much more that users need, starting with tables, floating images. lists, unordered lists etc. - All that is not possible with twig. - Or how about if users could chose to disable twig if they really don't wanna use it?

rhukster commented 8 years ago

I personally really like Twig. I think others do also as it's the template engine of choice for most new CMSes and PHP frameworks (Drupal 8, October CMS, Bolt CMS, Symfony,etc). It's widely supported, and more importantly a lot of people using Grav already know it!

PHP is all fine for power users, but it's a nightmare for front end developers to deal with and also it allow for more security issues.

The Great thing about Twig is that you can do as you did, and create a simple plugin to do whatever you need. I can create a plugin to provide any function (even pass throughs) that I could want. I even wrote a Twig extension that lets me output tabs and image sliders. It's really not that hard.

A lot of things you mention are not really intended for templates, they are content related and should be handled via that. We already have a 'shortcode' plugin in the GPM for that, and also with the next release you will be able to add plugins that extend Parsedown too. So that's going to add a lot of powerful functionality.

Thanks for your insights though!

mahagr commented 8 years ago

I have to agree that after messing with php in template files for years, twig feels much better once you have given it a change. With correct usage of twig you can easily get rid of all those XSS issues, which are so common (even if you get it right, there's always someone else who gets it wrong) and its cleaner looking than PHP. Its not meant to replace real code, but hey, you shouldn't be coding in template files anyway. :)

getgrav / grav

Large Website "Rebuilding" Improvements #343