ckan / ideas

[DEPRECATED] Use the main CKAN repo Discussions instead:
https://github.com/ckan/ckan/discussions
40 stars 2 forks source link

Filestore resource precompression #268

Open jqnatividad opened 3 years ago

jqnatividad commented 3 years ago

CKAN version 2.9.1

Describe the bug It's not strictly a bug, but I'm filing it as one, given the high payoff.

CKAN is widely used to distribute data. OOTB, max_resource_size is set to 10mb. Nowadays, that setting is miniscule and even when CKAN is configured to handle large files (>500mb), several related problems/issues present themselves:

  1. resource_create and resource_update file uploads can only handles up to 2gb files (otherwise, you get an Overflow error: string longer then 2147483647 bytes)
  2. even for files smaller than 2gb, the current implementation requires loading the whole file in memory, sometimes freezing the client.
  3. unreliable connections and timeouts as HTTP is not optimized for handling such large requests

These issues can be mitigated by using chunked/streaming uploads. Doing so is another issue by itself and will also require rework by existing Filestore API clients.

Several users have also migrated to alternate filestores like ckanext-cloudstorage to overcome this limitation.

But large file handling can be largely mitigated by adding native support for resource precompression in the Filestore API, in a way that is transparent to existing Filestore API clients.

With a big benefit for all CKAN users - developers, publishers, and downstream users alike:

To do so, the following needs to be implemented:

The only potential downside is the increased filestore storage footprint - as you now need to store both the compressed and uncompressed variants. IMHO, storage is very cheap and bandwidth/performance far more expensive/valuable.

But even here, we can use gzip static always, and force nginx to always serve the gzipped file, eliminating the need for storing the uncompressed variant - effectively reducing your filestore storage requirements as well!

It's also noteworthy that uwsgi supports gzip precompression with the static-gzip-all setting - https://ugu.readthedocs.io/en/latest/compress.html

jqnatividad commented 3 years ago

While considering this, its also noteworthy that both nginx and uwsgi also support the more efficient brotli compression format.

https://medium.com/oyotech/how-brotli-compression-gave-us-37-latency-improvement-14d41e50fee4

https://uwsgi-docs.readthedocs.io/en/latest/Changelog-2.0.16.html?highlight=brotli https://docs.nginx.com/nginx/admin-guide/dynamic-modules/brotli/

and all modern browsers support it as well. https://caniuse.com/brotli

and python requests brotli support is imminent: https://github.com/psf/requests/issues/4525