VlaamseKunstcollectie / Imagehub

A IIIF Presentation API compliant aggregator and web service
GNU General Public License v3.0
1 stars 2 forks source link

Improve performance for Fill Resourcespace command #7

Closed Hobbesball closed 5 years ago

Hobbesball commented 5 years ago

Detailed description of the issue. Uploading images to resourcespace using the app:fill-resourcespace command is now rather slow, and won't scale for thousands of images.

Additional context If we want this ETL to run periodically, it's important that it is optimised as much as possible. The biggest bottleneck will always be the speed at which the Resourcespace API handles requests and processes images, so there is a hard limit to how optimised this can be without changing the resourcespace code itself.

Possible implementation

Current environment This issue was encountered when testing the app:fill-resourcespace command to fill Resourcespace on the VPS with the 185 images used in the test setup.

Kitania commented 5 years ago

Fixed in 3498f51 - sort of, anyway

NOTE: make sure the following line of code is present in includes/config.php in the ResourceSpace installation the next time you run the command, otherwise it will screw up the filenames: $filename_field = NULL;

The next time you run this command, it will still be really slow as it needs to re-initialize all data. After that, subsequent runs should be really fast.

The bottleneck of the command was (and still is) the following line of code: new Imagick($fullImagePath)

Whenever this line of code is called, PHP reads the entire image into memory, even if we don't actually do anything with it. That being said, we can't expect a process to read and process 17GB of data in just a matter of seconds. One way or another, this bottleneck will always be there the first time the command is run. I don't think there is any way around that.

When removing all references to Imagick, the command is actually really fast and consumes a negligible amount of CPU and RAM. Trying to optimize the rest of the code won't make any kind of noticeable impact.

However, I did manage to pretty much completely remove the bottleneck in subsequent runs by only creating Imagick objects whenever needed. That is, whenever we notice the content of an image file has changed. This presented us with another problem: we can only know an image file has changed by looking at the hash of said image file.

ResourceSpace automatically generates an MD5 hash during upload, but we only upload a scaled down JPEG image as we don't want to upload the high-resolution image. To my knowledge, there is no way to set the 'file_checksum' field inside ResourceSpace through the API. There is also no metadata field in the current model suitable to contain this data. Therefore, we were forced to always create a scaled down JPEG file and take the MD5 hash of that JPEG file before we could know if an image had changed or not.

There are several possible solutions to this problem:

I chose the last option; although it doesn't seem like a perfect solution either, it's better than any of the other approaches. If you guys have alternative suggestions for this issue, feel free to let me know.

Hobbesball commented 5 years ago

thanks for the detailed writeup!

The solution we're going with to this is yet another option than the once you already listed: Making sure the files in the drop folder are already scaled down with JPEG compression, and uploading those files directly to resourcespace, which should result in the images in RS having the same hash as the ones in the drop folder. The implications of this are that:

Kitania commented 5 years ago

TL;DR version: the suggested approach does not seem to be viable - even with the scaled down files, uploading 1.7GB of images to ResourceSpace through the API takes half an hour, don't ask me why. Unless there are ways to heavily optimize the ResourceSpace API, I don't think we have any choice but to still scale down the images to small (< 1MB) JPEGs and upload these to ResourceSpace.

To verify the bottleneck in this case really is the ResourceSpace API, I performed a test run where I upload all 182 images directly to ResourceSpace, without using exiftool, Imagemagick or any kind of processing, so the only thing that happens is the actual upload (and subsequent image processing by the ResourceSpace API itself), it took half an hour to complete.

Secondly, I also encountered a bug in ResourceSpace: it can't handle file uploads through the API if the files don't have a filename extension.

ResourceSpace needs to know the extension of a resource and the code has built-in functionality for detecting extensions. It tries to do so through the filename; if no extension is found in the filename then it will use exiftool to auto-detect it. However, there is a bug in their code (line 162 in includes/image_processing.php) that throws an exception, resulting in the file upload failing:

$cmd=$exiftool_fullpath." -filetype -s -s -s ".escapeshellarg($processfile['tmp_name']);

For some reason, $processfile is not set, throwing an exception as a result. I have not yet dug deeper in their code to figure out why $processfile is not being initialized in the first place or if it's actually $processfile that needs to be used in the method call, rather than another variable.

Two possible workarounds (other than the ResourceSpace developers fixing the bug):

In my test setup, I have chosen the first approach because it is the least resource intensive. I have made and pushed these changes to a new branch, https://github.com/VlaamseKunstcollectie/Imagehub/tree/uncompressed_upload, where the suggested approach is implemented.

However, considering the command still takes half an hour to complete, the entire implementation seems rather useless to me. Also, further scaling down the already scaled down images also makes them very blurry in ResourceSpace, so that approach doesn't seem very practical either.

One more interesting thing I found is that processsing 1.7GB of data takes about 4 minutes to complete (on my local setup), while processing 17GB of data takes almost 5 minutes (278 seconds). These 278 seconds are spent on:

Subsequent runs take 30 seconds to complete, 25 seconds of which are spent retrieving data from the Datahub.

Therefore, I would suggest to stick with the uncompressed TIFF files in the drop folder, have the command store hashes of the uncompressed TIFFs in a local MongoDB database and scale down the images to JPEG before uploading them to ResourceSpace.

Hobbesball commented 5 years ago

To solve this issue we need:

After extensive discussion this is how we would like to solve this issue:

possible implementation

Hobbesball commented 5 years ago

additional comments:

Hobbesball commented 5 years ago

when testing fill-resourcespace with JPG compressed TIFFs, we ran into issues #23 and #24, which have since been fixed. Closing this issue for now.