btrask / stronglink

A searchable, syncable, content-addressable notetaking system
Other
1.04k stars 39 forks source link

File caching #38

Open btrask opened 9 years ago

btrask commented 9 years ago

Right now a blog results page spends roughly 1/3 of the time querying and 2/3 sending. That's pretty bad.

Each individual result is stored as a file on disk, which means for 50 results we open, read, and close 50 files. That process is already about as optimized as it can be (3 syscalls, 2 context switches) without actually caching the files in user space.

I think the way to do this is with mmap (cf. Varnish 2006 programming). The constraints are page size (which makes mapping tiny files wasteful) and address space (which makes mapping huge files wasteful, especially on 32-bit).

Note that individual cached items should be reference counted, and we actually need changes to libuv (uv_write) to do it completely right.

Frankly this should be a reusable library, not StrongLink-specific.

btrask commented 9 years ago

To clarify, we should not require that a local file descriptor is open while we are blocking on the network. Depending on traffic and cache pressure, we may need to re-open the file each time we need to read or write it! But obviously the cache should generally make sure that doesn't happen.

This isn't something that almost any web server gets right, AFAIK.

btrask commented 8 years ago

Closing and re-opening files, or mmapping partial chunks, potentially breaks atomicity if files are replaced during a transfer. Keeping the fd open or mmapping the whole thing is safer. So handling things as chunks might be a bad idea.