getkirby / kirby

Kirby's core application folder
https://getkirby.com
Other
1.24k stars 165 forks source link

IM driver crashing php process when creating many thumbs for the first time #5328

Open iskrisis opened 1 year ago

iskrisis commented 1 year ago

Description

I can't pin point out why or when but somewhere around v3.8 something about thumb creation changed. I know it from upgrading some image heavy sites to 3.8+. I have 4gig ram VPS and use im driver for most stuff to create avif/webp/jpg thumbs. It takes some time first time but it never was an issue. But with recent versions one can crash PHP process. Happened on both php-fpm and also clients Apache. Firefox starts to throw NS_BINDING_ABORTED erros for the images which i suspect is because error messages get served instead of image.

After asking @bastianallgeier he suspects it's https://github.com/getkirby/kirby/blob/6b5dda600472b59a3a779f7e6b8e9e130cc414a1/src/Cms/Media.php#L56C4-L56C4 change that loads/serves original files through PHP.

I've tried to setup im security policies really thoroughly and i don't think it makes much difference. The im processes from logs seems to go fine and it never crashes so i don't think it's a problem with im driver.

Curiously it doesn't seem to be a problem with gd but maybe thats just because gd can't create .avif which is the format that takes a long time to create.

I am not sure why this change has been made but i've managed to crash two live sites already when they launched. It's a bit nerve wrecking and really hard to track down. It would be really great if it worked as it did in the past.

iskrisis commented 1 year ago

Can i help this somehow? Like would it help if i created test repo with example that was crashing?

distantnative commented 1 year ago

@iskrisis that would help indeed

tobimori commented 1 year ago

I think I had experienced similiar issues in the past, but I couldn't pinpoint it (as they're always gone after the first load)

bastianallgeier commented 1 year ago

I agree. A test repo would be great. So far, I didn't run into this issue with our sandbox environments.

iskrisis commented 1 year ago

Playing with it whole afternoon. I am starting to think this is not fixable unless some architectural change would happen.

What i think is happening

When you have list page with enough images each of the thumb requests will spawn convert process iTerm2 - ploi@helpful-farm- ~-kirbyimagetest floriankarsten dev-media 2023-07-06 at 4 06 06 PM

Even when imagemagick has properly configured security policies (it will comfortably convert thumbs individually) what happens is that linux out of memory killer will kick in and at some point kills php-fpm too. This is pronounced longer the convert runs so with bigger files and .avif it's easier for this to happen. But it happens even with webp (and most likely would with jpg if pushed enough) The image conversion will fill up ram and swap and then it's matter of time.

Of course when you have server with enough ram then you won't hit this. Also i think it's a linux thing and might work differently on win/mac. Plus kinda impossible to make happen on developers machine.

Observations

Test

Repo here https://github.com/iskrisis/test-kirby-image-crasher but it's just plainkit with loop of some hires thumbs. Tested on 2GB ram hetzner vps with ubuntu with what i think are the most common settings. Imagemagick security policies were setup and i could create each image individually with cli directly.

Logs

php-fpm logs

Jul 06 13:48:02 helpful-farm systemd[1]: Started The PHP 8.1 FastCGI Process Manager.
Jul 06 13:48:41 helpful-farm systemd[1]: php8.1-fpm.service: A process of this unit has been killed by the OOM killer.
Jul 06 13:48:42 helpful-farm systemd[1]: php8.1-fpm.service: Failed with result 'oom-kill'.
Jul 06 13:48:42 helpful-farm systemd[1]: php8.1-fpm.service: Consumed 22.965s CPU time.

digging why it was oom-killed - process convert (imagemagick)

[Thu Jul  6 13:13:54 2023] Out of memory: Killed process 46637 (convert) total-vm:1259896kB, anon-rss:327376kB, file-rss:1988kB, shmem-rss:0kB, UID:1000 pgtables:1924kB oom_score_adj:0
[Thu Jul  6 13:48:40 2023] Out of memory: Killed process 47353 (convert) total-vm:779280kB, anon-rss:381476kB, file-rss:1960kB, shmem-rss:0kB, UID:1000 pgtables:1028kB oom_score_adj:0

My takeaway

I didn't hit on this in past just because of luck we have 4gig ram server and i've never hit the right kind of load. This doesn't seem to happen with GD probably because it is extension not driver that runs in cli. I am thinking i should really try to write driver for vips php extension because this might always.

lukasbestle commented 1 year ago

We could solve this with a semaphore (PHP sem_acquire()). This function does not exist in all PHP installations, but we could check if it does and treat it as progressive enhancement.

It would work something like this:

bastianallgeier commented 1 year ago

@lukasbestle I never heard about it before, but if this could help to solve the problem, I'm all in. We could also think about adding an imagick driver to instead of im. The PHP extension would also probably be a safer option in case it exists.

iskrisis commented 1 year ago

I have no idea why this doesn't happen with php-gd but if its the case also for the extension version of imagemagick than that would probably be easier to make and more supported? Cpanel and most shared hostings i've seen have the extension either on by default or under option from gui. Ploi also has it as option. Seems like semaphore is not only that well supported but also pretty complex.

iskrisis commented 1 year ago

It's really pitty SimpleImage doesn't support more extensions. https://github.com/claviska/SimpleImage/issues/308 It seems to me like this wouldn't have to be that much different from php-gd driver.

For example https://github.com/Intervention/image wraps both gd and imagemagick (even vips with additional driver). I understand why you are hesitant to include these big dependencies... maybe Kirby core should keep it gd only and it should be plugin for Intervention that supports extensions.

lukasbestle commented 1 year ago

I think it would be worth trying the Imagick extension in a plugin. If it doesn't show this behavior, is more stable but still works well for everything we need, this could be a good option.

Semaphores are not that complex, but not many users would benefit from them.

bastianallgeier commented 1 year ago

TBH, I wouldn't mind to move away from SimpleImage. We use it forever, but I'm sure it's not the best image library out there.

iskrisis commented 7 months ago

@adamkiss has made prototype with Kirby Intervention driver but it doesn't seem to solve many issues. It's slow. I've also managed to crash php-gd and thus i asum even php-im crashes. Even Vips thats order of magnitude faster is most likely in same boat (if you push it hard enough).

It doesn't solve underlying issue that if you fire too many thumb tasks you might crash php-fpm. Especially on VPSes with smaller memory.

Not sure what's the solution here. Some kind of throttle. I am not sure about semaphore @lukasbestle mentioned before. Other solution would be some kind of optional queue for .jobs with cron - thats how i've seen other projects to handle. There must be some settings to protect the processes on linux side but i couldn't find the correct ones.

But for me this is currently quite a big painpoint. We stopped using .avif and i am thinking about basically solving it by throwing everything on one big server that will survive a lot bigger bursts.

lukasbestle commented 7 months ago

The issue with Semaphores would be that all additional calls beyond the "ImageMagick concurrency limit" would either block (= make the PHP-FPM worker and request hang) or fail (= don't return the requested thumb).

I like the idea of a job queue. Basically an independent process running on the server that processes the incoming jobs. Maybe that could be implemented as a plugin for the Kirby CLI? If it helps with handling large numbers of thumbs, we could then think about integrating it into the core somehow.

tobimori commented 7 months ago

A core-supported Job/queue system would be very well appreciated. I could definitely make use of this for other plugins.

iskrisis commented 6 months ago

Kirby already uses .jobs folder for whitelisting the thumb creation. Is kinda similar to synchronous queue.

Worker for these .jobs would be straighforward BUT the problem is i dont want every whitelisted thumb to be created i want only those that are actualy requested by clients. Srcsets etc are very broad and it would be very wasteful to create every possible thub. On the other hand being explicit about thumbs is very hard and restricting with all the new filedormats and resolutions.

Maybe some kind of lock/semaphore/process manager for tracking number of thumb processes currently working with optional limit setting is solution fitting Kirby better.

Imho i am ok with returning placeholder image (or something) if the process limit is reached. This is rare situation on first visits - people will refresh or whatever next time the images are there. One could probably even very crude way sleep() and retry in the request.

Its much much better solution than crashing webserver along with all websites on it.

lukasbestle commented 6 months ago

The advantage with semaphores would be that they would "just work" if they are available. Versus a queue that first needs to be set up with a cronjob or daemon process.

Maybe we could do something like "if there are no more than x running convert processes, process the thumb, otherwise wait for half a second and try again, otherwise return a placeholder image". Do you have an idea what the placeholder image could be or how it could work?

In any case, before we go for that route, we would first need to check how common the semaphore feature is in the wild. Wouldn't be worth it if only a fraction of PHP installations had it.

iskrisis commented 6 months ago

Placeholder - for me even 1px gray image would be fine? But i think first request would this be to have this customizable. Which is probably just matter of saying the image is always /media/_placeholder.jpg or something.

About support i've very quickly tried some hostings using

<?php

if (extension_loaded('sysvsem')) {
    echo "sysvsem/semaphores loaded";
} else {
    echo "sysvsem/semaphores unavailable ";
}
?>

To my surprise most everything i have access to supports it out of the box. Ploi includes them by default (no need to turn on the extension). Uberspace also have it on by default. Two cpanel based shared hostings have them available as extensions and were turned on by default. Out of the 5 hostings i've tried only wierd Czech one doesn't have it on by default (and they don't have gui to turn extensions on).

I thought i will have tu turn on the extension but it seems to be one of the more common ones. Laravel Valet/Herd also have it available by default.

lukasbestle commented 6 months ago

That's great, thanks for testing!

ovenum commented 1 month ago

Can confirm @iskrisis observations that on image heavy sites with many visitors oom-kill will be summoned when lots of convert process are spawned for source set generation. Even worse if the original image has a large file size.

This can eventually take a whole site offline if oom-kill decides it’s time to terminate the php-fpm process and it is not setup to automatically restart.

nilshoerrmann commented 1 month ago

We seem to encounter similar issue, but with GD.

@iskrisis: What have you done do work around this issue?

ovenum commented 3 weeks ago

We seem to encounter similar issue, but with GD.

@iskrisis: What have you done do work around this issue?

@nilshoerrmann you might be already aware of this, but leaving this here for others that require a quick fix.

You could use a hook to generate srcsets when images are uploaded to the panel. By this you can limit the amount of concurrent image creations taking place on the server.

The problem with this is that when the modified timestamp of the page the image(s) belong to changes Kirby will re-run the srcset creation on request. Just caused a couple of 503s when changing page numbers…