[4.x]: Images are deleted from s3 potentially related to section resaving fail

joepagan commented 1 year ago

What happened?

Description

We have found an issue where s3 source images are being deleted, despite no request to do so. The assets still exist in the DB, as do all their relations.

This happened to us again yesterday on an app, a deployment for a new release occurred, and we changed a pre-existing transform's dimensions. The source image was deleted some time ago on May 17th, however, the transforms still existed so it appeared like nothing was wrong until a new release was deployed June 7th. The entire app & server died because it thought the source image exists and it's trying to create a new transform.

We previously thought this may have been because UAT & PROD environments shared the same s3 bucket, though this is no longer the case on a new env-specific bucket with new secure keys.

It looks like this scenario the following occurred on 17/05/2023:

11:11:40 - A new release was triggered and a new ECS task recently took over from the prior release. I can see the old ECS task die at exactly this time.
11:12:16.857620 - app.ERROR: [188149] t9n:["app","Resaving {section} entries",{"section":"Image Gallery"}] (attempt: 1, pid: 34) - Error: The process "'/usr/local/bin/php' '/var/www/html/cms/craft' 'queue/exec' '188149' '300' '1' '34' '--color='" exceeded the timeout of 300 seconds. {"trace":[],"memory":32699648,"category":"craft\\queue\\QueueLogBehavior::afterError","timestamp":1684318336.818001} []
11:12:16.857725 - app.ERROR: Symfony\Component\Process\Exception\ProcessTimedOutException: The process "'/usr/local/bin/php' '/var/www/html/cms/craft' 'queue/exec' '188149' '300' '1' '34' '--color='" exceeded the timeout of 300 seconds. in /var/www/html/cms/vendor/symfony/process/Process.php:1154
11:12:16.858749 - dupe request as above

The in s3 I can see several delete markers against images in s3 at around the same time, where this specific image deleted at May 17, 2023, 11:12:05 (UTC+01:00):

Appreciate this does show the deletion before, the task on the server seems to start, but I cannot see any other activity in our logs which would be related. Maybe the server and s3 logs aren't in sync?

I am not sure why exactly the Image Gallery section is resaving, this is not part of any process we have, not part of the deployment. Not sure what caused the resave task, do you know what this could have been?

I cannot see any delete request in made to s3 from the app in datadog either, so I strongly suspect this timeout is the cause.

Thankfully we have versioning turned ON for this bucket so we could delete the "delete marker" to restore them. It appears that roughly 10 images had gone this time around. I can confirm 7/10 were assets from the same entry, however, from the remaining 3 maybe 1 or 2 were from another entry (hard to say as was trying to bring the app back in a panic and didn't log 100%). They all appear to have a delete marker at approximately the same time.

Let me know if I can provide more info

Steps to reproduce

Deploy a new release
Something triggers a section to be resaved
section resave task times out
some s3 source images have been deleted

Expected behavior

Maybe the section doesn't need to be resaved at all?
Timeout on task shouldn't delete source images.

Actual behavior

timeout on task appears to delete s3 source images

Craft CMS version

"craftcms/cms": "4.3.6.1",

PHP version

php:8.1-fpm-alpine

Operating system and version

php:8.1-fpm-alpine

Database type and version

5.7.mysql_aurora.2.11.0

Image driver and version

No response

Installed plugins and versions

-

i-just commented 1 year ago

Hi, thanks for reaching out.

I am not sure why exactly the Image Gallery section is resaving, this is not part of any process we have, not part of the deployment. Not sure what caused the resave task, do you know what this could have been?

The Resaving {section} entries job is triggered when the section changes.

The assets still exist in the DB, as do all their relations.

And you are sure that it hasn’t been trashed (the asset will still exist in the DB, and the relations will still be there if the asset has been trashed but not permanently deleted)? Are you able to view the asset via Control Panel > Assets?

So the image you’re referring to was marked as deleted on 17/05/2023, and the deployment that triggered the section resave happened on 07/06/2023 - is that correct? Do you still have the logs from 17/05/2023? Is there anything there to suggest what caused the deletion?

joepagan commented 1 year ago

And you are sure that it hasn’t been trashed (the asset will still exist in the DB, and the relations will still be there if the asset has been trashed but not permanently deleted)? Are you able to view the asset via Control Panel > Assets?

Yes that is correct, the asset still existed in the database, hence the templates trying to create new transforms from it, but one could not view the source image because it had been deleted.

So the image you’re referring to was marked as deleted on 17/05/2023, and the deployment that triggered the section resave happened on 07/06/2023 - is that correct?

No sorry, the source image was marked as deleted 17/05/2023, approx the same time as a release on 17/05/2023, the bullet list above was referring to timings on 17/05/2023, sorry that should have been clearer. Updating original comment to be clear

The release on 07/06/2023 revealed the deletion on 17/05/2023 because new transforms were required but there was no source image to make the new transforms with.

Do you still have the logs from 17/05/2023? Is there anything there to suggest what caused the deletion?

Yes, the logs provided on my prior comment's list are all from 17/05/2023. The only thing I can see at that time of the deletion appears to be this section resave which did not complete, that task times out.

i-just commented 1 year ago

We have discussed this internally and could not reproduce such behaviour nor think of a scenario where Craft would delete a file but not the asset element.

We previously thought this may have been because UAT & PROD environments shared the same s3 bucket, though this is no longer the case on a new env-specific bucket with new secure keys.

Sharing an S3 bucket between environments could be the cause of such behaviour.

When were the UAT & Prod S3 buckets separated? Is Prod sharing the S3 bucket with any other environment(s)?

craftcms / cms