Warn in docs that Chrome doesn't cache large .data and .wasm files

vadimkantorov commented 4 years ago

I have a primitive http server that sets ETag for all files, including html/js/data/wasm. However, Chrome seems to not want to cache any file larger than a few dozen megabytes. This leads to reloading large data files, and this is slow-ish even on localhost.

My question on SO about this: https://stackoverflow.com/questions/63891436/chrome-refuses-to-cache-binary-data-files, code:

import os
import hashlib
import http.server

root = '.'

mime = {
    '.manifest': 'text/cache-manifest',
    '.html': 'text/html',
    '.png': 'image/png',
    '.jpg': 'image/jpg',
    '.svg': 'image/svg+xml',
    '.css': 'text/css',
    '.js': 'application/x-javascript',
    '.wasm': 'application/wasm',
    '.data': 'application/octet-stream',
}
mime_fallback = 'application/octet-stream'

def md5(file_path):
    hash = hashlib.md5()
    with open(file_path, 'rb') as f:
        hash.update(f.read())
    return hash.hexdigest()

cache = {os.path.join(root, f) : md5(f) for f in os.listdir() if any(map(f.endswith, mime)) and os.path.isfile(f)}

class EtagHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self, body = True):
        self.protocol_version = 'HTTP/1.1'
        self.path = os.path.join(root, self.path.lstrip('/') + ('index.html' if self.path == '/' else ''))

        if not os.path.exists(self.path) or not os.path.isfile(self.path):
            self.send_response(404)

        elif self.path not in cache or cache[self.path] != self.headers.get('If-None-Match'):
            content_type = ([content_type for ext, content_type in sorted(mime.items(), reverse = True) if self.path.endswith(ext)] + [mime_fallback])[0]
            with open(self.path, 'rb') as f:
                content = f.read()

            self.send_response(200)
            self.send_header('Content-Length', len(content))
            self.send_header('Content-Type', content_type)
            self.send_header('ETag', cache[self.path])
            self.end_headers()
            self.wfile.write(content)

        else:
            self.send_response(304)
            self.send_header('ETag', cache[self.path])
            self.end_headers()

if __name__ == '__main__':
    PORT = 8080
    print("serving at port", PORT)
    httpd = http.server.HTTPServer(("", PORT), EtagHandler)
    httpd.serve_forever()

One way forward may be to force file_packager to generate chunks of 10Mb and have them load in parallel (same for large wasm files).

Related: https://github.com/emscripten-core/emscripten/issues/4711. Maybe worth providing an option for IndexedDB caching for wasm in the meanwhile...

The problem with .wasm is also very acute, since it's happening at every module reload. Would you have an advice on dumping/reloading heap / other state? So that I could try to reset the module (at least softly)

vadimkantorov commented 4 years ago

Btw some time ago TensorFlow.js did exactly that - it produced and served binary files in chunks - now I understand that this was probably done to force caching

rth commented 3 years ago

Indeed. Short of chunking the data be less that ~50MB by chunk there isn't much emscripten can do about it. It's likely possible to increase that limit in config for a particular browser install (at least it's possible in Firefox), but that won't apply to all other users. At the same time I understand why browsers have such limit by default so asking them to increase it is likely not going to be accepted.

So I think this can likely be closed. There is a specific issue on chunking large files with file_packager.py in https://github.com/emscripten-core/emscripten/issues/12342

vadimkantorov commented 3 years ago

This problem should at least be highlighted in the docs

vadimkantorov commented 3 years ago

For wasm / data files, it may at least be worth creating an issue at chromium bugs, since it's a legitimate usecase for discussion (even if the decision is negative)

kripken commented 3 years ago

A docs PR sounds good (maybe an FAQ entry)?

inliquid commented 3 years ago

I have the same problem with WASM binaries generated by Go compiler. In particular, my current client binary size is ~14 MB, but it's transferred gzipped as ~2.5 MB which is not that big after all. However, Chrome keeps reloading it on every page load, no matter what I do. It caches everything else including big JPEG, PNG, and SVG files. I have both set max-age and immutable (which is also not supported by Chrome) header attributes. It's apparently a bug, but from my experience, posting it is just a waste of time. I'm watching over this issue for quite some time already.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

inliquid commented 2 years ago

I kind of update: this issue persists when we use local self-signed certificate for https/2. It seems that wasm files are cached to disk (but not in memory) as expected (?) when we use certificate that was obtained from correct CA.

soyuka commented 1 year ago

@vadimkantorov did you find a solution?

vadimkantorov commented 1 year ago

The only viable solution is to implement chunking in some way for the file_packager.py (feature request here https://github.com/emscripten-core/emscripten/issues/12342, but until it's done automatically you'd need to do it by crafting file lists manually I guess). Files under 45-50Mb are cached okay by Chrome, but always make sure to test.

For super-large wasm files, I guess one will have to implement chunking and loading the module manually as well. There might be some hiccups if one wants to use streaming compilation, but maybe modern javascript can allow to implement somehow streamable arraybuffer

ezyang commented 7 months ago

One thing that I noticed is that Chrome's decision to cache or not can be influenced by compression algorithm; e.g., a 40MB text file won't cache but if you transmit it in gzip form down to 200KB it wil lcache.

emscripten-core / emscripten

Warn in docs that Chrome doesn't cache large .data and .wasm files #12212