Support large flat directories

nh2 commented 2 years ago

I have a flat directory with 200 M files in it. Currently Bupstash uses > 128 GB memory on this, and crashes.

The OOM happens during the phase where it does statx() on all the contents. The previous getdents() phase succeeds on it.

(I have this directory both on CephFS and ZFS, just to highlight that this is not super uncommon.)

It would be nice if Bupstash would not OOM on this.

nh2 commented 2 years ago

In

https://github.com/andrewchambers/bupstash/blob/b0c7b88353ebc0a7d5dff1e202098275ddd9fce9/src/indexer.rs#L549-L551

it looks like it's getting all stat()s into a big vector and then does the sort based on the file names.

 let mut dir_ents: Vec<(PathBuf, std::io::Result<CombinedMetadata>)> = self.metadata_fetcher.parallel_get_metadata(dir_ent_paths);

That seems like it uses unnecessarily much memory: sizeof(struct stat64) == 144 on Linux. Rusts struct Metadata contains this, as well as Option<StatxExtraFields>, which will be at least another 24 bytes, probably 32 bytes if the Option is aligned by Rust (not sure). So then we're at 176 Bytes for struct Metadata.

Now

struct CombinedMetadata {
    stat: std::fs::Metadata,              // 176 Bytes
    xattrs: Option<index::Xattrs>,        // >16 Bytes? when there are no xattrs this is a `BTreeMap<std::ffi::OsString, Vec<u8>>`
    sparse: bool,                         //   8 Bytes
    data_hash: index::ContentCryptoHash,  //  40 Bytes ("Option" around `Blake3([u8; 32])`)
}                                         // ---------
                                          // 240 Bytes total, at least

My files have ~40 char filenames on average, resulting in 280 bytes per dir_ents entry.

Couldn't we get just the file names, sort those, and then get the metadata in a streaming-parallel fashion afterwards?

This would reduce memory usage by 7x in my case, from 56 GB to 8 GB.

If yes, there's a caveat:

On some file systems like CephFS, stat() information is stored next to the getdents() information (at least I think so from recent learnings, see this issue). That means while statting in directory order is very fast, statting in sorted order is very slow because it leads to massive read amplification (each stat() has to fetch a whole CephFS directory fragment and then discard all but 1 entry). (Ceph recently had a patch merged to avoid the insane read amplification, but it would still cost the network roundtrip.) That means that the current approach is better for such file systems; CephFS is the only one I know that does this so far, so bupstash may still want to change its approach here for normal local file systems.

However, the current approach just doesn't work for single large dirs, and eventually, even the file names won't fit into memory. So I think a more thorough solution for large dirs would make the

let mut dir_ents = self.metadata_fetcher.parallel_get_metadata(dir_ent_paths);

not return Vec<(PathBuf, std::io::Result<CombinedMetadata>)>, but instead use something that can page the results to disk and sort them there (e.g. sqlite, rocksdb, whatever embedded kv store).

A simpler approach than that would be to just write the Vec<(PathBuf, std::io::Result<CombinedMetadata>)> on disk in getdents() order (a simple mmap() might suffice), and then instead of sorting that Vec, sort an array of indices into it only. That would require only 8 bytes per entry, instead of 240 + len(filename) + memory allocator fragmentation overhead, thus reducing RES memory usage by 35x.

In my case that would be from 56 GB to 1.6 GB.

andrewchambers commented 2 years ago

@nh2 It is definitely worth thinking about and fixing.

One thing I did in another part of bupstash is lz4 compress such a huge array (in the other case the index) which can be effective since this sort of thing is quite highly compressible and is always iterated in order. This would also get rid of fragmentation because we can estimate a good buffer size initially for compression and keep the data in a single huge allocation.

There is another place such a large flat directory would fail - which is the stat cache - I think for such directories we can't really do stat caching the same way we do it for smaller directories without overloading the send log database. You can manually disable that with --no-stat-caching

With regard to paging - I suppose that is the operating systems job, but perhaps we need to help it out somehow.

andrewchambers commented 2 years ago

Another idea to split the directory at some count/file size/heuristic to continue streaming, there is no reason all the entries for a single directory have to be in a single 'CombinedMetadata' - but it will take some tweaking of other code to make it work.

We can also combine all the ideas if needed.

andrewchambers / bupstash

Support large flat directories #314