Open maechler opened 5 years ago
@arnoschoon Could you please have a look at this and https://github.com/aws/aws-sdk-php/issues/1758?
@helhum @rvollebregt @haassie Is there anybody out there?
any update here?
Hi Markus and Esteban,
the reason we implemented copyObject
in the first place is to update the objects' metadata (CacheControl
in this case). As far as I'm aware the stream wrapper doesn't support passing metadata along, let alone modifying an object without actually touching the body (eg. without downloading/uploading the whole thing).
Maybe https://github.com/aws/aws-sdk-js/issues/1092 might be related or at least explains a bit of the rationale for using this method.
I think your suggestion about clearing the cache manually is actually the only option because this updating is done asynchronously due to the fact TYPO3 fails (or at least failed while we were developing this) to trigger an event when creating a processed file (after scaling it for example). Processed files are one of the main reasons we want to set Cache-Control headers, and besides that it is now possible to modify the settings through FAL indexing so it doesn't hurt the users' experience when working with TYPO3.
We might consider to set patterns for cache-control behaviors in CloudFront, but that forces us to manage that outside of TYPO3 and possibly limits the fine-grained controls (per file type) we have now.
That being said I think we can state this problem isn't caused by the underlying library, at least in my opinion.
Are the the files on your site that fail to show up processed? In that case I understand why the files are not created due to a cache entry. If you're talking about "actual" files it's more confusing on why TYPO3 pages still have any reference to them (FAL should've taken care of that already in that case).
Hi @maechler,
just thinking this through, you might try to implement https://www.php.net/manual/en/function.clearstatcache.php instead of keeping a reference to MaxServ\FalS3\Driver\Cache
.
It's been a while, but the full stream path seems to be consistent throughout requests/cron jobs (https://github.com/MaxServ/t3ext-fal_s3/blob/master/Classes/Driver/AmazonS3Driver.php#L168) even though a cryptic prefix is used.
@arnoschoon Thanks for your feedback on this!
Yes, the files that fail to show up are processed files. I have to admit I have not yet completely understood how to reproduce this error.
I have found out the following:
[filename]_[checksum].[extension]
file uid
, file type
, file name
, modified time
and the processing configuration
.fal_s3
cache are not the same. That is how I came across this cache inconsistency issue.Update storage index
has mostly solved the issue for us, probably because it lead to a lot of copyObject
calls. However the underlying cache inconsistency problem remains.We also use s3fs-fuse
(https://github.com/s3fs-fuse/s3fs-fuse) in our setup as a fallback for extensions not fully supporting FAL, that might be why the cache inconsistency is a problem for us.
That there is a cache inconsistency issue between the stream wrapper API and the client object API is confirmed by Amazon: https://github.com/aws/aws-sdk-php/issues/1758
They however consider it a feature and not a bug. That is why I think it would be best to solve this in the fal_s3
extension by clearing the stream wrapper cache in some way after the copyObject
call.
I will have a deeper look at this again in the upcoming weeks and test whether using clearstatcache
would also be a viable option.
Hi @maechler,
the fact that deactivating the Update storage index
task reduces your problem (partly) and the number of copyObject
calls is interesting because this seems like another issue (possibly causing the cache inconsistency and files not showing up).
First of all I'm missing actual hands-on experience with s3fs-fuse
so I'm making assumptions here. If the folder fileadmin/
is synced between S3 and your local file system the modified time
of an object is probably changed. A changed modified time
causes FAL to re-index it, thus causing copyObject
calls in that process.
Could you try with a second FAL storage that just uses S3 (and another bucket for that matter). It should probable give us an idea if a tool like s3fs-fuse
is causing things.
As for caching, if you use the default configuration cache is stored in a database table (cf_tx_fal_s3
if I recall correctly), the key should be the path to a file, so when you say the missing image can't be found in S3 it shouldn't appear here, otherwise file_exists()
should return FALSE and the thumbnail should be created.
Hi @arnoschoon
I am sorry I did not respond earlier, I thought I would be given time to work on this soon, but it was not until today I had time to check this again. But at least I think I got to the core of the problem today. Unfortunately clearing the cache with clearstatcache
or through calling MaxServ\FalS3\Driver\Cache->remove
does not solve the issue.
Our problem
sys_file.modification_date
gets out of sync with what is the true value on the file system. This may happen for instance if files are synced from one environment to another. Because this is a common issue, TYPO3 offers a scheduler task to deal with this.Update storage index
is running periodically, it detects that the modified time is out of sync and updates the value in the database in TYPO3\CMS\Core\Resource\Index\Indexer->updateIndexEntry
.MaxServ\FalS3\CacheControl\RemoteObjectUpdater->onLocalMetadataRecordUpdatedOrCreated
is called. It updates the remote metadata using $client->copyObject
, which in turn changes the modified time of the file. This leaves the file system modified time and sys_file.modification_date
out of sync.Update storage index
will again notice the difference of the modified time and start the whole process again, effectively leaving the database and file system out of sync forever.Step | sys_file.modification_date | Actual modified time |
---|---|---|
0 | 0 | 0 |
1 | 0 | 10 |
2 | 10 | 10 |
3 | 10 | 11 |
Further problems
The thing gets even worse if we additionally take MaxServ\FalS3\Driver\Cache
into account. The value of this cache will also be out of sync. Depending on the order of reading the modified time and clearing the cache it will either have the actual modified time or some wrong, old value.
This is described in detail here: https://github.com/aws/aws-sdk-php/issues/1758
Even without the scheduler task Update storage index
I think that the database will get out of sync with the true value as soon as MaxServ\FalS3\CacheControl\RemoteObjectUpdater->onLocalMetadataRecordUpdatedOrCreated
and $client->copyObject
is called once.
The issue may seem to happen randomly and not with every file. This is probably due to the fact that the modified time is measured in seconds. Thus the error only occurs if step 2 and 3 do not happen within the same second.
Possible solutions
$client->copyObject
to update metadata.clearstatcache(true, $key)
(clear PHP cache) and MaxServ\FalS3\Driver\Cache->remove('$key)
(clear FAL database cache) after $client->copyObject
. Additionally we would have to set sys_file.modification_date
to the correct value.You mentioned that (1) is probably tricky. I tried to implement (2) but did hesitate to update sys_file.modification_date
directly because I am unsure about side effects this might have. I finally ended up doing (3) and hope that this solves our problem.
I am willing to support a better solution for this, be it through sharing my analysis, testing or maybe even further development.
Hi @maechler
thanks for the pull request, I agree it's more workaround than solution but at least it minimizes the number of times it can go wrong. Besides it should fix CloudFront with an empty cache-control
header which I noticed on https://d8zqac7yjs3xx.cloudfront.net/_processed_/b/e/csm_Slider-Tafels_74b49316c7.jpg for example.
I did just discuss this case with a colleague of mine and we suspect copyObject
goes horribly wrong when special characters are used in the key, https://d1p6n69pfnpnhu.cloudfront.net/_processed_/5/0/csm_Couple%20running%20in%20the%20city_GettyImages-970175634%20L__1564750031_185.107.38.194_fd831396f6.jpg was one of the files on a page in which we saw the behavior you mentioned. I'm still trying to reproduce the issue since this special character thing is just a gut feeling.
The reason for updating the modification time of an object is still something I don't get, while copying could be seen as modifying countermeasures should be in place to reset themtime
.
We'll start further investigation/reproducing the copy that takes place and seemingly removes the "old" image (which causes the site to break). Looking into the cause of this behavior will follow when we dealt with the symptoms.
Regarding special characters I remember a fix (https://github.com/MaxServ/t3ext-fal_s3/pull/42) I implemented a year ago. Although this is probably not the issue you are having at the moment. From our observations we could not conclude that the missing processed images had something to do with special characters. We found the modified time that contributes to the checksum to be our problem.
It seems that the modified time in an S3 bucket can not be controlled through the API: https://stackoverflow.com/questions/42718324/how-to-change-file-upload-date-in-amazon-s3-using-aws-cli
Good to know, that you are working on this as well. If I can be of any help, please let me know.
Hi @maechler
it seems I've managed to reproduce the issue, yet we still need to find the actual cause.
Interestingly enough it didn't lead me to the improper cache behavior of this extension, but I guess somewhere FAL Core related (still need to dig a bit deeper). For the sake of completenees I used a bare/minimal installation of TYPO3 CMS v9.5.9, v1.9.1 of this extension and v3.112.7 of aws/aws-sdk-php.
From what I found the (processed) image is deleted and replaced if the modification_date
of sys_file
is changed (e.g. update sys_file set creation_date=1569426245, modification_date=1569426245 where uid=3;
). Replaced is not the correct term, an equal image is stored, but with another checksum (so not identical).
As you suggested (but what I wanted to find out myself, for reassurance) special characters don't come into play. It all seems to boil down to the modification_date
, but sadly not the one used by S3 and being cached. It's possible that an update job synchronizes the timestamp for the storage with sys_file
.
If you're using the same image on multiple pages a cache miss might trigger FAL and removing previous copies of files referenced by other pages that are already cached.
I tested/reproduced the issue using the file list (preview thumbnails), so not images disappearing from a page but I noticed that a file was removed after the change to the modification_date
column.
Ideally the old images should remain on the storage (I'm still not sure if this is specific to S3) and be scheduled for deletion (if the complete cache is truncated?) or a change in FAL should bubble up to the frontend cache of TYPO3 (with use of tags on page cache entries).
I hope you can reproduce the behavior described above and possibly test it in a real-world situation. If it's clear that we're talking about the same issues here we can start thinking about a solution. My suggestion for now would be (due to lack of in-depth knowledge of tags used by the caching framework) to flush page cache through the caching framework from FAL.
Hi @arnoschoon
Thanks for your analysis, up to now I did not have a recipe on how to reproduce the issue. I can test this next week.
I am not sure why the old images are removed from the storage even though they are still on cached pages. But finding the root cause of this could definitely be useful for us.
I have not tested it, but I think using cache tags could be something simple like the following. Although this will lead to a lot of cache tags per page (one per image).
// Add cache tags to page that uses a file with uid $fileUid
$GLOBALS['TSFE']->addCacheTags(['fal_s3_' . $fileUid]);
// Clear caches e.g. when metadata of file with uid $fileUid changes
$cacheManager = GeneralUtility::makeInstance(CacheManager::class);
$cacheManager->flushCachesInGroupByTag('pages', 'fal_s3_' . $fileUid);
Additionally I thought of a way on how to break the metadata update cycle I described in steps 0-4 in https://github.com/MaxServ/t3ext-fal_s3/issues/49#issuecomment-534686281.
Do we really need to update the metadata every time onLocalMetadataRecordUpdatedOrCreated
is triggered? Or was there a way to check whether something relevant changed and only update the remote metadata if really necessary?
Hi @maechler
As to the tags added to the page cache I think it shouldn't be an issue, in my opinion such tags allow for better cache management (you might want to flush some pages if the description of an image changes but not the whole FE in case of a large site). Anyway, I think this should be implemented on the "FAL side" and not be specific to this extension, so I would propose a tag like sys_file:$uid
.
When it comes to breaking the cycle I think we could check if the metadata (cache-control directive) we'd like to set doesn't already exist on the storage side. I always assumed, and used, the flag last_index
in sys_file
to protect me from multiple (unnecessary) invocations of onLocalMetadataRecordUpdatedOrCreated
. It seems due to the update of mtime
on S3 we ended up with some race condition.
While deleting files from the storage still doesn't make sense (without taking care of cached pages), the reduction of calls to copyObject
(if the content doesn't change) should avoid removing of files at random moments but limit the behavior to when you're actually developing (touching cache-control or other related settings).
I'll start working on a patch, let me get back to you ;)
Hi @arnoschoon
I could reproduce the issue with TYPO3
v8.7.25
, fal_s3
v1.9.0
and aws_sdk
v3.91.1
. I created two pages that both include the same file with the same processing instructions and requested the pages so they got cached. Then I changed the modification_date
in sys_file
to a different value and cleared the cache of page 1. Page 1 shows the image correctly, but with a new hash. On page 2 the image still has a reference to the old processed image version which has been deleted.
I agree that this is probably a TYPO3 core issue and should be addressed on the FAL side. Whilst debugging I came across the exact line where the deletion occurs: https://github.com/TYPO3/TYPO3.CMS/blob/9bb994f492e75efcfe139c3d6a64ef42dbef9bd1/typo3/sysext/core/Classes/Resource/ProcessedFile.php#L513-L515
public function needsReprocessing()
{
$fileMustBeRecreated = false;
// [..]
if (array_key_exists('checksum', $this->properties) && $this->calculateChecksum() !== $this->properties['checksum']) {
$fileMustBeRecreated = true;
}
// [..]
if ($fileMustBeRecreated && $this->exists()) {
$this->delete();
}
return $fileMustBeRecreated;
}
The nasty thing happens within a seemingly innocent method called needsReprocessing()
, in my opinion this is bound to have undesired side effects. The code still looks the same in the current TYPO3 v10.1
.
To be able to keep old processed images one would probably have to implement something like a reference count for processed images and only delete the file if the reference count becomes zero. Probably easier and also more adaptable could be the solution you suggested by using cache tags and flushing the caches appropriately. This would even allow someone to change the image by uploading it over FTP and propagate the change to all pages that include this image.
What do you think? Are you working on a patch for TYPO3 core?
@arnoschoon Any update on this? It would be nice if the time that both of us already invested in analysing this issue would also lead to an improvement of the TYPO3 core that other users can benefit from.
On page 2 the image still has a reference to the old processed image version which has been deleted.
@maechler Did you create a ticket on Forge for this?
@helhum No, because I thought @arnoschoon was working on this, but probably I just got this wrong. I will create a ticket and post the link here as soon as I get to it.
Thanks @maechler ,
I actually forgot about this issue after patching that race condition with metadata and have been focused on a bunch of other project. Good to see that you’re working to further reduce the symptoms of this issue 🙂
We recently observed an issue where some processed files would not show up (
403 Forbidden
) in the frontend because they did not exist anymore in the S3 bucket. Clearing the frontend cache resolved the issue, however files kept disappearing afterwards.We tracked the issue down to a cache inconsistency caused by
$client->copyObject
. When an object is copied using$client->copyObject
its modified time in the S3 bucket gets updated. The registered stream wrapper cacheMaxServ\FalS3\Driver\Cache
however is not cleared for that object. Because the modified time is part of the checksum used for processed files, this can lead to the observed behaviour.The underlying problem lies within the AWS SDK library. I already created an issue describing the details: https://github.com/aws/aws-sdk-php/issues/1758
However I do not know how fast (if at all) this will be solved. Possible workarounds for the FAL S3 extension could be:
1) Replace
$client->copyObject
withcopy()
. I have not tested this, but I would hope that the stream wrapper cache gets cleared usingcopy()
. 2) Store a reference toMaxServ\FalS3\Driver\Cache
and clear the cache manually for the corresponding object after$client->copyObject
.I would be happy to create a pull request to solve this issue. However I do not know what exactly the call to
$client->copyObject
is used for. I would really appreciate if you could have a look at this issue.https://github.com/MaxServ/t3ext-fal_s3/blob/d30a59949163422240a44245c960067c0a0d4857/Classes/CacheControl/RemoteObjectUpdater.php#L152-L160