MaxServ / t3ext-fal_s3

S3 driver for FAL
GNU General Public License v2.0
13 stars 11 forks source link

Using S3Client->copyObject leads to inconsistent stream wrapper cache #49

Open maechler opened 5 years ago

maechler commented 5 years ago

We recently observed an issue where some processed files would not show up (403 Forbidden) in the frontend because they did not exist anymore in the S3 bucket. Clearing the frontend cache resolved the issue, however files kept disappearing afterwards.

We tracked the issue down to a cache inconsistency caused by $client->copyObject. When an object is copied using $client->copyObject its modified time in the S3 bucket gets updated. The registered stream wrapper cache MaxServ\FalS3\Driver\Cache however is not cleared for that object. Because the modified time is part of the checksum used for processed files, this can lead to the observed behaviour.

The underlying problem lies within the AWS SDK library. I already created an issue describing the details: https://github.com/aws/aws-sdk-php/issues/1758

However I do not know how fast (if at all) this will be solved. Possible workarounds for the FAL S3 extension could be:

1) Replace $client->copyObject with copy(). I have not tested this, but I would hope that the stream wrapper cache gets cleared using copy(). 2) Store a reference to MaxServ\FalS3\Driver\Cache and clear the cache manually for the corresponding object after $client->copyObject.

I would be happy to create a pull request to solve this issue. However I do not know what exactly the call to $client->copyObject is used for. I would really appreciate if you could have a look at this issue.

https://github.com/MaxServ/t3ext-fal_s3/blob/d30a59949163422240a44245c960067c0a0d4857/Classes/CacheControl/RemoteObjectUpdater.php#L152-L160

maechler commented 5 years ago

@arnoschoon Could you please have a look at this and https://github.com/aws/aws-sdk-php/issues/1758?

maechler commented 5 years ago

@helhum @rvollebregt @haassie Is there anybody out there?

macjohnny commented 5 years ago

any update here?

arnoschoon commented 5 years ago

Hi Markus and Esteban,

the reason we implemented copyObject in the first place is to update the objects' metadata (CacheControl in this case). As far as I'm aware the stream wrapper doesn't support passing metadata along, let alone modifying an object without actually touching the body (eg. without downloading/uploading the whole thing).

Maybe https://github.com/aws/aws-sdk-js/issues/1092 might be related or at least explains a bit of the rationale for using this method.

I think your suggestion about clearing the cache manually is actually the only option because this updating is done asynchronously due to the fact TYPO3 fails (or at least failed while we were developing this) to trigger an event when creating a processed file (after scaling it for example). Processed files are one of the main reasons we want to set Cache-Control headers, and besides that it is now possible to modify the settings through FAL indexing so it doesn't hurt the users' experience when working with TYPO3.

We might consider to set patterns for cache-control behaviors in CloudFront, but that forces us to manage that outside of TYPO3 and possibly limits the fine-grained controls (per file type) we have now.

That being said I think we can state this problem isn't caused by the underlying library, at least in my opinion.

Are the the files on your site that fail to show up processed? In that case I understand why the files are not created due to a cache entry. If you're talking about "actual" files it's more confusing on why TYPO3 pages still have any reference to them (FAL should've taken care of that already in that case).

arnoschoon commented 5 years ago

Hi @maechler,

just thinking this through, you might try to implement https://www.php.net/manual/en/function.clearstatcache.php instead of keeping a reference to MaxServ\FalS3\Driver\Cache.

It's been a while, but the full stream path seems to be consistent throughout requests/cron jobs (https://github.com/MaxServ/t3ext-fal_s3/blob/master/Classes/Driver/AmazonS3Driver.php#L168) even though a cryptic prefix is used.

maechler commented 5 years ago

@arnoschoon Thanks for your feedback on this!

Yes, the files that fail to show up are processed files. I have to admit I have not yet completely understood how to reproduce this error.

I have found out the following:

  1. Some processed images do not show up in the frontend. The images are also not present in the S3-Bucket.
  2. Clearing the frontend cache solves the issue.
  3. The only thing that changed in the image, that did not show up before clearing the cache, is the checksum: [filename]_[checksum].[extension]
  4. The checksum is made up of file uid, file type, file name, modified time and the processing configuration.
  5. Images that do not show up in the frontend used a wrong modified time for the checksum calculation.
  6. The modified time of the file in the S3-Bucket and the fal_s3 cache are not the same. That is how I came across this cache inconsistency issue.
  7. Deactivating the scheduler task Update storage index has mostly solved the issue for us, probably because it lead to a lot of copyObject calls. However the underlying cache inconsistency problem remains.

We also use s3fs-fuse (https://github.com/s3fs-fuse/s3fs-fuse) in our setup as a fallback for extensions not fully supporting FAL, that might be why the cache inconsistency is a problem for us.

That there is a cache inconsistency issue between the stream wrapper API and the client object API is confirmed by Amazon: https://github.com/aws/aws-sdk-php/issues/1758 They however consider it a feature and not a bug. That is why I think it would be best to solve this in the fal_s3 extension by clearing the stream wrapper cache in some way after the copyObject call.

I will have a deeper look at this again in the upcoming weeks and test whether using clearstatcache would also be a viable option.

arnoschoon commented 5 years ago

Hi @maechler,

the fact that deactivating the Update storage index task reduces your problem (partly) and the number of copyObject calls is interesting because this seems like another issue (possibly causing the cache inconsistency and files not showing up).

First of all I'm missing actual hands-on experience with s3fs-fuse so I'm making assumptions here. If the folder fileadmin/ is synced between S3 and your local file system the modified time of an object is probably changed. A changed modified time causes FAL to re-index it, thus causing copyObject calls in that process.

Could you try with a second FAL storage that just uses S3 (and another bucket for that matter). It should probable give us an idea if a tool like s3fs-fuse is causing things.

As for caching, if you use the default configuration cache is stored in a database table (cf_tx_fal_s3 if I recall correctly), the key should be the path to a file, so when you say the missing image can't be found in S3 it shouldn't appear here, otherwise file_exists() should return FALSE and the thumbnail should be created.

maechler commented 5 years ago

Hi @arnoschoon

I am sorry I did not respond earlier, I thought I would be given time to work on this soon, but it was not until today I had time to check this again. But at least I think I got to the core of the problem today. Unfortunately clearing the cache with clearstatcache or through calling MaxServ\FalS3\Driver\Cache->remove does not solve the issue.

Our problem

  1. The modification time of a file is perfectly in sync with the database and the file system.
  2. For some reason the modified time in sys_file.modification_date gets out of sync with what is the true value on the file system. This may happen for instance if files are synced from one environment to another. Because this is a common issue, TYPO3 offers a scheduler task to deal with this.
  3. The scheduler task Update storage index is running periodically, it detects that the modified time is out of sync and updates the value in the database in TYPO3\CMS\Core\Resource\Index\Indexer->updateIndexEntry.
  4. Through a Signal Slot the method MaxServ\FalS3\CacheControl\RemoteObjectUpdater->onLocalMetadataRecordUpdatedOrCreated is called. It updates the remote metadata using $client->copyObject, which in turn changes the modified time of the file. This leaves the file system modified time and sys_file.modification_date out of sync.
  5. The next day the task Update storage index will again notice the difference of the modified time and start the whole process again, effectively leaving the database and file system out of sync forever.
Step sys_file.modification_date Actual modified time
0 0 0
1 0 10
2 10 10
3 10 11

Further problems

The thing gets even worse if we additionally take MaxServ\FalS3\Driver\Cache into account. The value of this cache will also be out of sync. Depending on the order of reading the modified time and clearing the cache it will either have the actual modified time or some wrong, old value. This is described in detail here: https://github.com/aws/aws-sdk-php/issues/1758

Even without the scheduler task Update storage index I think that the database will get out of sync with the true value as soon as MaxServ\FalS3\CacheControl\RemoteObjectUpdater->onLocalMetadataRecordUpdatedOrCreated and $client->copyObject is called once.

The issue may seem to happen randomly and not with every file. This is probably due to the fact that the modified time is measured in seconds. Thus the error only occurs if step 2 and 3 do not happen within the same second.

Possible solutions

  1. Avoid using $client->copyObject to update metadata.
  2. Use clearstatcache(true, $key) (clear PHP cache) and MaxServ\FalS3\Driver\Cache->remove('$key) (clear FAL database cache) after $client->copyObject. Additionally we would have to set sys_file.modification_date to the correct value.
  3. Avoid cache control headers (more of a workaround than a solution). As we do not use the cache control headers, we actually do not need to update the metadata. The following pull request helps to avoid such unnecessary updates: https://github.com/MaxServ/t3ext-fal_s3/pull/51

You mentioned that (1) is probably tricky. I tried to implement (2) but did hesitate to update sys_file.modification_date directly because I am unsure about side effects this might have. I finally ended up doing (3) and hope that this solves our problem.

I am willing to support a better solution for this, be it through sharing my analysis, testing or maybe even further development.

arnoschoon commented 5 years ago

Hi @maechler

thanks for the pull request, I agree it's more workaround than solution but at least it minimizes the number of times it can go wrong. Besides it should fix CloudFront with an empty cache-control header which I noticed on https://d8zqac7yjs3xx.cloudfront.net/_processed_/b/e/csm_Slider-Tafels_74b49316c7.jpg for example.

I did just discuss this case with a colleague of mine and we suspect copyObject goes horribly wrong when special characters are used in the key, https://d1p6n69pfnpnhu.cloudfront.net/_processed_/5/0/csm_Couple%20running%20in%20the%20city_GettyImages-970175634%20L__1564750031_185.107.38.194_fd831396f6.jpg was one of the files on a page in which we saw the behavior you mentioned. I'm still trying to reproduce the issue since this special character thing is just a gut feeling.

The reason for updating the modification time of an object is still something I don't get, while copying could be seen as modifying countermeasures should be in place to reset themtime.

We'll start further investigation/reproducing the copy that takes place and seemingly removes the "old" image (which causes the site to break). Looking into the cause of this behavior will follow when we dealt with the symptoms.

maechler commented 5 years ago

Regarding special characters I remember a fix (https://github.com/MaxServ/t3ext-fal_s3/pull/42) I implemented a year ago. Although this is probably not the issue you are having at the moment. From our observations we could not conclude that the missing processed images had something to do with special characters. We found the modified time that contributes to the checksum to be our problem.

It seems that the modified time in an S3 bucket can not be controlled through the API: https://stackoverflow.com/questions/42718324/how-to-change-file-upload-date-in-amazon-s3-using-aws-cli

Good to know, that you are working on this as well. If I can be of any help, please let me know.

arnoschoon commented 5 years ago

Hi @maechler

it seems I've managed to reproduce the issue, yet we still need to find the actual cause.

Interestingly enough it didn't lead me to the improper cache behavior of this extension, but I guess somewhere FAL Core related (still need to dig a bit deeper). For the sake of completenees I used a bare/minimal installation of TYPO3 CMS v9.5.9, v1.9.1 of this extension and v3.112.7 of aws/aws-sdk-php.

From what I found the (processed) image is deleted and replaced if the modification_date of sys_file is changed (e.g. update sys_file set creation_date=1569426245, modification_date=1569426245 where uid=3;). Replaced is not the correct term, an equal image is stored, but with another checksum (so not identical).

As you suggested (but what I wanted to find out myself, for reassurance) special characters don't come into play. It all seems to boil down to the modification_date, but sadly not the one used by S3 and being cached. It's possible that an update job synchronizes the timestamp for the storage with sys_file.

If you're using the same image on multiple pages a cache miss might trigger FAL and removing previous copies of files referenced by other pages that are already cached.

I tested/reproduced the issue using the file list (preview thumbnails), so not images disappearing from a page but I noticed that a file was removed after the change to the modification_date column.

Ideally the old images should remain on the storage (I'm still not sure if this is specific to S3) and be scheduled for deletion (if the complete cache is truncated?) or a change in FAL should bubble up to the frontend cache of TYPO3 (with use of tags on page cache entries).

I hope you can reproduce the behavior described above and possibly test it in a real-world situation. If it's clear that we're talking about the same issues here we can start thinking about a solution. My suggestion for now would be (due to lack of in-depth knowledge of tags used by the caching framework) to flush page cache through the caching framework from FAL.

maechler commented 5 years ago

Hi @arnoschoon

Thanks for your analysis, up to now I did not have a recipe on how to reproduce the issue. I can test this next week.

I am not sure why the old images are removed from the storage even though they are still on cached pages. But finding the root cause of this could definitely be useful for us.

I have not tested it, but I think using cache tags could be something simple like the following. Although this will lead to a lot of cache tags per page (one per image).

// Add cache tags to page that uses a file with uid $fileUid
$GLOBALS['TSFE']->addCacheTags(['fal_s3_' . $fileUid]);

// Clear caches e.g. when metadata of file with uid $fileUid changes
$cacheManager = GeneralUtility::makeInstance(CacheManager::class);
$cacheManager->flushCachesInGroupByTag('pages', 'fal_s3_' . $fileUid);

Additionally I thought of a way on how to break the metadata update cycle I described in steps 0-4 in https://github.com/MaxServ/t3ext-fal_s3/issues/49#issuecomment-534686281. Do we really need to update the metadata every time onLocalMetadataRecordUpdatedOrCreated is triggered? Or was there a way to check whether something relevant changed and only update the remote metadata if really necessary?

arnoschoon commented 5 years ago

Hi @maechler

As to the tags added to the page cache I think it shouldn't be an issue, in my opinion such tags allow for better cache management (you might want to flush some pages if the description of an image changes but not the whole FE in case of a large site). Anyway, I think this should be implemented on the "FAL side" and not be specific to this extension, so I would propose a tag like sys_file:$uid.

When it comes to breaking the cycle I think we could check if the metadata (cache-control directive) we'd like to set doesn't already exist on the storage side. I always assumed, and used, the flag last_index in sys_file to protect me from multiple (unnecessary) invocations of onLocalMetadataRecordUpdatedOrCreated. It seems due to the update of mtime on S3 we ended up with some race condition.

While deleting files from the storage still doesn't make sense (without taking care of cached pages), the reduction of calls to copyObject (if the content doesn't change) should avoid removing of files at random moments but limit the behavior to when you're actually developing (touching cache-control or other related settings).

I'll start working on a patch, let me get back to you ;)

maechler commented 5 years ago

Hi @arnoschoon

I could reproduce the issue with TYPO3 v8.7.25, fal_s3 v1.9.0 and aws_sdk v3.91.1. I created two pages that both include the same file with the same processing instructions and requested the pages so they got cached. Then I changed the modification_date in sys_file to a different value and cleared the cache of page 1. Page 1 shows the image correctly, but with a new hash. On page 2 the image still has a reference to the old processed image version which has been deleted.

I agree that this is probably a TYPO3 core issue and should be addressed on the FAL side. Whilst debugging I came across the exact line where the deletion occurs: https://github.com/TYPO3/TYPO3.CMS/blob/9bb994f492e75efcfe139c3d6a64ef42dbef9bd1/typo3/sysext/core/Classes/Resource/ProcessedFile.php#L513-L515

public function needsReprocessing()
{
    $fileMustBeRecreated = false;
    // [..]
    if (array_key_exists('checksum', $this->properties) && $this->calculateChecksum() !== $this->properties['checksum']) {
       $fileMustBeRecreated = true;
    }
    // [..]
    if ($fileMustBeRecreated && $this->exists()) {
       $this->delete();
    }
    return $fileMustBeRecreated;
}

The nasty thing happens within a seemingly innocent method called needsReprocessing(), in my opinion this is bound to have undesired side effects. The code still looks the same in the current TYPO3 v10.1.

To be able to keep old processed images one would probably have to implement something like a reference count for processed images and only delete the file if the reference count becomes zero. Probably easier and also more adaptable could be the solution you suggested by using cache tags and flushing the caches appropriately. This would even allow someone to change the image by uploading it over FTP and propagate the change to all pages that include this image.

What do you think? Are you working on a patch for TYPO3 core?

maechler commented 4 years ago

@arnoschoon Any update on this? It would be nice if the time that both of us already invested in analysing this issue would also lead to an improvement of the TYPO3 core that other users can benefit from.

helhum commented 4 years ago

On page 2 the image still has a reference to the old processed image version which has been deleted.

@maechler Did you create a ticket on Forge for this?

maechler commented 4 years ago

@helhum No, because I thought @arnoschoon was working on this, but probably I just got this wrong. I will create a ticket and post the link here as soon as I get to it.

maechler commented 4 years ago

Finally https://forge.typo3.org/issues/90600

arnoschoon commented 4 years ago

Thanks @maechler ,

I actually forgot about this issue after patching that race condition with metadata and have been focused on a bunch of other project. Good to see that you’re working to further reduce the symptoms of this issue 🙂