Unified upload pipeline

WoltLab / WCF

WoltLab Suite Core (previously WoltLab Community Framework)

https://www.woltlab.com

GNU Lesser General Public License v2.1

237 stars 144 forks source link

Unified upload pipeline #5668

Closed BurntimeX closed 4 months ago

BurntimeX commented 1 year ago

File uploads have been implemented differently several times in the software so far. The unified upload pipeline is intended to standardize these cases and make upload-relevant features available throughout:

Chunked uploads
Client-side image scaling
Server-side image scaling
User defined image processing (clipping, rotating, focal point for thumbnails, etc.)
Thumbnail generation

ref https://www.woltlab.com/community/thread/278320-automatische-bildskalierung-auch-in-der-galerie/ https://www.woltlab.com/community/thread/266577-chunked-upload/ https://www.woltlab.com/community/thread/266059-titelbild-positionieren/ https://www.woltlab.com/community/thread/291287-%C3%BCberpr%C3%BCfung-der-erlaubten-dateiendung/

dtdesign commented 1 year ago

Uploads should be tracked in a central database table and physically located in a directory provided by the core itself. This allows for automated clean ups because files do not need to be tracked through countless different places.

Initially there was the idea of cleanly separating files by the package they belong to with additional grouping based on the type but if one thinks about it, that approach solves nothing. Grouping them together has several advantages:

A uniform directory structure using an /ab/12/ pattern allows for ~650 million files before reaching a 10k limit per directory.
Having one centralized directory makes it easy to clean up because filenames follow a unique pattern that allows for an incremental scan.
- We can traverse both the file system and the database in predictable steps because for it is a single incrementing ID for everything.
- File cleanups can optionally be asynchronous because we are able to map the existing files against the database.
Advanced configuration opportunities on a web server / filesystem level:
- The directory itself can be mounted onto a separate device if capacity is a concern.
- Moving all uploads into a central directory allows for simple security practices like path-based disabling of script execution.
- Known path prefix can be used to dynamically cache files for example using HAProxy’s cache feature.

In general, housekeeping of files can be greatly simplified because only the core needs to keep track of the files and validate them against the database. Applications or plugins would no longer need to take care of this themselves, just marking the upload entry for removal (or just dropping the entry itself) would be sufficient for everything to be eventually be cleaned up.

dtdesign commented 1 year ago

On top of the general file uploads we can introduce a secondary layer that manages thumbnails. This would be a parallel structure that co-exist among the upload system but depends on it.

Thumbnails are attached to a single upload and store – among other data – their file size, SHA-1 checksum as well as a JSON-serialized list of settings that were used to create those thumbnails. This offers a multitude of benefits:

Damaged thumbnails can easily be detected and recreated based on their file size and SHA-1 checksum.
The serialized list of settings can be used to check if a thumbnail needs to be rebuilt. If the file matches the file size and checksum, and all settings are the same, then the file is still valid as-is.
- The list of settings must be normalized and sorted.

Thumbnails would need an extra text column to describe the type of it so that (uploadID,name) can be used as the primary key of the table.

Keeping thumbnails separated from the regular upload pipeline allows them to be created asynchronously and/or have them processed by a compression library. I’m not sure if we can achieve atomic updates of thumbnails or if that is even worth it.

We may need to denormalize the list of available thumbnails and store them alongside the upload entry to make it easier to make decisions based on this. However, that would make for some awful queries and one would still run into issues when that information is out of sync. Maybe this is a premature optimization and we should simply see how far we can get without tapping into this kind of shenanigans?

dtdesign commented 1 year ago

Historically attachments were saved without file extensions to prevent them being abused for some kind of remote code execution. A few versions back we switched to the .bin extension which solves the same problem but also prevents braindead FTP programs from transmitting images as ASCII.

However, this means that any attachment (and later, media files) have to be piped through PHP which is inefficient in multiple ways. It ties up resources that can be better used elsewhere and especially for images there are already web servers that do a significantly better job than we could ever do.

The directory structure and file names already make it impossible to guess the name of a file so we could design the system in a way that allows hot linking these files. This requires the files to retain their file extension so that the web server is able to pick up the files and serve them with their proper file type to browsers.

For safety reasons this should only be done for a fixed list of file extensions that are known to be safe for direct access:

avif (image)
bz2 (compressed file)
gif (image)
gz (compressed file)
jpg (image)
jpeg (image)
mp3 (audio)
mp4 (video)
ogg (audio)
pdf (pdf, haha surprise!)
png (image)
rar (compressed file)
svg (image)
tar (archive)
tiff (image)
tga (image)
txt (plain text)
webm (video)
webp (image)
woff2 (font)
zip (compressed file)

dtdesign commented 10 months ago

The file upload in the browser must be an asynchronous data pipeline with event access to allow custom behavior, such as scaling or other form of image manipulation to take place. This could even allow some actual UI to be presented to the user as part of the workflow.

Supporting chunked uploads requires additional logic to handle the temporary caching and stitching. In order to avoid multiple code paths, we can simply treat every upload as a chunked upload which comes with a small overhead but offers us a greatly simplified code flow.

dtdesign commented 10 months ago

We need a dedicated setting that limits the maximum size of files uploaded this way. Stitching together the file is a I/O-heavy operation that is limited by the available I/O bandwidth which is effectively unknown to the process.

For example, a SATA SSD will typically have write speeds of about 500 MB/s which may sound a lot but also means that the default max_execution_time of 30 seconds only allows for ~15 GB of data to be written. The actual throughput will be lower because of possible other I/O activity and for this example assumes that we get the full exclusive bandwidth.

In practice, writing 10 GB to a SATA SSD is possibly a reasonable upper limit that can be achieved within 30 seconds. This looks even more grim when dealing with SATA HDDs which is around 140 MB/s, allowing for a theoretical maximum of 4.2 GB being written in 30 seconds. However, unlike SSDs, the same HDD is possibly responsible reading the source data which means that writing a file larger than ~2 GB in 30 seconds is unlikely to succeed.

The actual available bandwidth is most likely unknown to the site operator which makes this even more difficult. We could set the default limit to 1 GB which should work on pretty much every system.

dtdesign commented 10 months ago

https://www.php.net/manual/en/info.configuration.php#ini.max-execution-time

On non Windows systems, the maximum execution time is not affected by system calls, stream operations etc. Please see the set_time_limit() function for more details.

That is interesting and suggest that file I/O actually does not count towards the max_execution_time, at least on non Windows systems. However, the web server and any reverse proxy (if present) have their own timeouts so we still need reasonable limits to prevent the requests from timing out.

dtdesign commented 9 months ago

The files at minimum require an objectTypeID to be able to infer where they belong, but other than that no further information is required.

At first I though that we should also store an object id, but that has the drawback of duplicating data and due to the lack of foreign keys is prone to be unreliable. Actually, we do not need the objectID at all:

For 1:1 mappings, the referring object can just store the file id itself. For example, a style could store its page logo as pageLogoFileID that in return is a foreign key that references wcf1_file.fileID.
For 1:n mappings an intermediate mapping table can be used to store the references. Attachments are an example of such a mapping where multiple files are associated with a single object.