More efficient file storage

btrask commented 9 years ago

Surprisingly, our submission bottleneck is the file system, not the database.

Process for adding a file:

Write it to a temporary location
fsync the file
Atomic rename (actually link(2)) to final location
Open the parent directory
fsync the directory
Close the directory

We don't do any batching and can't really because the directories accessed are random (first byte of the hash).

Surprisingly again, there's no reason for our on-disk file representation to actually use content addresses. Once we look up the file info in the database, which we always have to do anyway, we might as well use the file ID or some other sequential ID to access the file.

I'm thinking we could write files to a spread of ~100 directories in batches of 1000. So files 1-1000 go in directory 1, 1001-2000 in directory 2, etc. Then it wraps back around so 10,001-11,000 are back in directory 1 again.

For batch submissions (hopefully most submissions, depending on #1), this should cut the syscall overhead from like 5 down to 2, plus 3 for the whole batch. And the number of fsyncs would be cut from 2 per file to 1.

btrask commented 9 years ago

A problem with the above idea of using sequential IDs instead of hashes for internal storage is that we can end up with junk in the file system when a transaction rolls back. Coordinating transactions with the file system so they both get rolled back atomically is hard.

btrask commented 9 years ago

BTW there is also the idea of storing small files directly in the database. I don't know the exact tipping point where it becomes worth it, but most file systems use 4K blocks and LSM trees suffer with large blobs during compaction. A reasonable threshold might be anywhere between 128B and 2KB.

btrask / stronglink

More efficient file storage #84