cnj-digital / seotamic

A Statamic v4 SEO addon
21 stars 12 forks source link

memory allocation problem during the sitemap generation #52

Closed beliven-fabrizio-gortani closed 9 months ago

beliven-fabrizio-gortani commented 1 year ago

We purchased the SEOtamic PRO license to be able to generate the sitemap on our website.

When attempting to access the /sitemap.xml, PHP FastCGI throws a 500 error with the following message:

2023/10/24 13:23:52 [error] 4171854#4171854: *65937 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in 

/home/myuser/mydomain.com/vendor/symfony/yaml/Parser.php on line 107PHP message: PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 32768 bytes) in 

/home/myuser/mydomain.com/vendor/symfony/error-handler/Error/FatalError.php on line 1" while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: mydomain.com, request: "GET /sitemap.xml HTTP/2.0", upstream: "fastcgi://unix:/var/run/php/php8.1-fpm-myuser.sock:", host: "mydomain.com"

We resolved this issue by increasing the memory limit. However, I also noticed that there isn't any caching logic to store the sitemap content.

Each request to sitemap.xml triggers the regeneration of the sitemap.

The final sitemap.xml size is 448Kb.

Is there a feature available to save the sitemap.xml into a file, thus avoiding regeneration with each request?

martink635 commented 1 year ago

Thanks for opening this.

is this website large? Lots of entries/collections? Since the output is almost half a megabyte it seems so 🙂

What would be your preferred option here?

Would it be enough to have a config value, where you can specify for how long to cache the sitemap? Clearing the Statamic cache manually would also clear the sitemap cache.

A better option would be to listen to entries events (create/delete/change), but that needs a bit more thought and time to execute perfectly.

beliven-fabrizio-gortani commented 1 year ago

Thanks for reply.

I fully agree with the second proposal you made, and I understand that a complete integration involving Statamic entities could be quite complex and time-consuming to test.

An initial alternative, if deemed simpler, could be:

During the generation phase, the system creates the sitemap and saves it to a file named sitemap.xml in the public folder. Requests to retrieve the sitemap.xml file are all handled by the reverse proxy (nginx) without passing through PHP. A button is added that allows the user to reload the sitemap. Pressing the button invokes the generation script. With this implementation, the user essentially delegates the sitemap generation. Clearly, this is not the most robust solution, as many content creators who create pages may not be familiar with the concept of a sitemap. To overcome this issue, you could expose an Artisan command that triggers sitemap generation through a scheduler.

martink635 commented 1 year ago

I think I found an elegant and simple solution, it's on the latest commit in the master branch (https://github.com/cnj-digital/seotamic/commit/57adb8db8ae5949c8d18b0b08e83cebff4e9adf6)

It caches the sitemap forever. I moved the loops into the cache part so it loads instantly when it's cached. Also the cache key is reset every time you create/delete/edit an entry. If we need to act on more Statamic events, we can simply add them in the subscriber.

This is still not the most performant solution, since on multi-sites it loops through the same content multiple times. But this is beyond this fix at this moment. If you want to dig deeper, this is probably the culprit: https://github.com/cnj-digital/seotamic/blob/57adb8db8ae5949c8d18b0b08e83cebff4e9adf6/src/Http/Controllers/SitemapController.php#L32 Feel free to open a PR to make it faster.

I will tag a new version in a couple of hours as I finish a different feature.

beliven-fabrizio-gortani commented 1 year ago

Thanks! Do you think this fix could also be released for version 3 of Seotamic? We are currently using that version with Statamic 3.3. Alternatively, do you think we can upgrade from version 3 to 4 without losing data?

martink635 commented 1 year ago

You can upgrade, but you also have to upgrade the Statamic version to 4.

I strongly suggest upgrading sites to keep up with the updates, but I understand this is not always possible.

I've backported the fix to the v3 version, currently in the v3 branch: https://github.com/cnj-digital/seotamic/tree/v3

I am unable to test this at the moment. Can you please check this out and confirm that it works. If so I will release a new version of v3 with the sitemap update.

martink635 commented 1 year ago

@beliven-fabrizio-gortani Did you manage to test this, so I can merge it in the v3 branch?

beliven-fabrizio-gortani commented 1 year ago

Hi. Sorry for the late reply. Today, I scheduled the task to one of my team members. As soon as they finish the test, I'll update you. I hope to give you feedback by the end of the week!

beliven-fabrizio-gortani commented 1 year ago

Hi @martink635 We have perform a first test: the results seems strange:

These are the stats after updates (V3 branch): total size of file xml => 80k total time to download => 0.2s

These are the stats before updates (latest v3 stable): total size of file xml => 432k total time to download => 11s

These stats are the cURL stats generated using the command curl -O https://...

It seems that download times have drastically reduced. However, it also seems that the sitemap has lost size. In the next few days, I will delve deeper to understand if it has removed/ignored some entries.

martink635 commented 1 year ago

Thanks for the feedback, let me know if there's anything missing from the output.

beliven-fabrizio-gortani commented 1 year ago

Ok @martink635 , I've compared the two files.

this is the old one.. (just the first url)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>https://domain.dev/benessere-donna</loc>
        <lastmod>2023-07-05T14:34:57+02:00</lastmod>

        <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="default" href="https://domain.dev/benessere-donna" />

        <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="english" href="https://domain.dev/en/woman-health" />

        <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="spanish" href="https://domain.dev/es/bienestar-mujer" />

        <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="german" href="https://domain.dev/de/frauenhilfe" />

        <xhtml:link xmlns:xhtml="http://www.w3.org/1999/xhtml" rel="alternate" hreflang="french" href="https://domain.dev/fr/bien-etre-des-femmes" />

    </url>

the new one instead

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

    <url>
        <loc>https://domain.dev/benessere-donna</loc>
        <lastmod>2023-07-05T14:34:57+02:00</lastmod>
    </url>

In the new sitemap the languages alternate urls seem missing.

martink635 commented 1 year ago

It works fine on v4. Not sure if it's a new Statamic API that was not supported (I think not). I do not have a v3 site to test this on. Would you mind sharing the repo access of a simpler reproduction (or the whole thing)?

Otherwise you can debug/inspect it yourself here: https://github.com/cnj-digital/seotamic/blob/01a30d8b41dc357b26157d4aa58726d77dbaf775/src/Http/Controllers/SitemapController.php#L32

The whole block is cached a the top, don't forget to change rememberForever to remember, and setting it to a 1 second or something similar.

martink635 commented 9 months ago

I am closing this, since there was no reply… Feel free to reopen it with new information.