reimplement files cache using hashindex code

ThomasWaldmann commented 3 years ago

the files cache uses python dicts currently because cache entries also have variable length lists (the chunks list). this makes it easy to deal with, but also comes with some memory overhead.

@RonnyPfannschmidt had this idea for a more efficient replacement:

a hashindex for the fixed size entries (all fixed stuff, including the pointer mentioned below for the chunks list)
store all the chunk list items into one big array and point into that array via (starting index, length of list)

ThomasWaldmann commented 3 years ago

@RonnyPfannschmidt do you want to implement that? we could have a bounty for this.

RonnyPfannschmidt commented 3 years ago

Potentially, but not soon, currently I am rather thin stretched

ThomasWaldmann commented 3 years ago

Current implementation:

FileCacheEntry = namedtuple('FileCacheEntry', 'age inode size cmtime chunk_ids')
age: 0 .. max_ttl, int
inode: see st_ino, int 64bit?
size: see st_size, int 64bit
cmtime: see st_(c|m)time_ns, currently int 64bit or bigint bytestr, 64bit should be ok!?
chunk_ids: list of chunk hashes, 256bits each
path_hash = key.id_hash(filepath)
files[path_hash] = msgpack.packb(FileCacheEntry())

Notable:

the packb() already takes away quite some of the python memory overhead as it serializes the FileCacheEntry (meaning it reduces the python object count to 1), represents ints as short as possible, etc.
the python memory overhead we still have: the files dict, keys and values are just bytes objects
bytes objects have an overhead of 33 additional bytes
key: 32b key + 33b overhead, value: Nb value + 33b overhead
how much overhead does the dict container add?
how much does msgpacking add (+) or save (-)?
- -3b: age usually fits in 7bit (msgpack: 1b)
- +1b: inode usually needs 64bit (msgpack: 9b)
- -4b: size is frequently small: almost always fits in 32bit (msgpack: 5b), frequently also fits into 16bit int (msgpack: 3b)
- +1b: cmtime usually needs 64bit (msgpack: 9b)
- +1b: chunk_ids is usually a small array
- +2b: one chunk_id (msgpack: 10b)
- +1b: overall data struct (tuple) is a small array
overall: 66b overhead minus approx. 1b msgpack changes == 65b overhead (compared to raw C data types, not to python objects)

ThomasWaldmann commented 3 years ago

New implementation idea:

FileCache hashindex entry:
age: 32bit (shared for in-band hashtable management)
inode: 64bit
size: 64bit
cmtime: 64bit
chunk_ids_start: 64bit
chunk_ids_count: 64bit
path_hash: 256bit (key)

chunk_ids array of chunk_id elements, 256bit each

Notable:

fixed recordsize layout, no usage of msgpack, thus no related savings
no python object overhead
overhead: 16b for the chunk_ids_start/count

ThomasWaldmann commented 3 years ago

For people running into memory issues, I guess we can assume they have a lot of relatively small files, so there is only 1 chunk in the chunk_ids list.

So, do I estimate correctly, that we'ld usually save 65 - 16 == 49 bytes per file?

The new implementation would need 108b per file.

So coming from the current implementation, savings would be ~30%.

If i didn't miscalculate/estimate something, that makes it look like a lot of work for not so much savings.

Could still make sense in the context of #3868 where a fixed length record size might be useful.

ThomasWaldmann commented 3 years ago

@RonnyPfannschmidt can you check? ^^^

RonnyPfannschmidt commented 3 years ago

I'll give it a spin later in the evening, im currently working on details for an memory layout idea that may be of use to sort it differently

RonnyPfannschmidt commented 3 years ago

I have a basic idea for a structure which allows to use shareable memory mapped files and potentially 32 bit for most values, going by the assumption that in practice it is very unlikely to observe more than 2**30 chunk items there,

The idea needs to be tested, will try once the toddler sleeps

RonnyPfannschmidt commented 3 years ago

@ThomasWaldmann my initial experiment failed, and needs some better planning

i believe i'm able to provide a mmap backed variant of the hash-index in order to get that working correctly i need to experiment with understanding the involved data-structures and get a handle on the bucketing/reading/mapping

Gelma commented 3 years ago

Thanks a lot for your effort, guys.

Just for the record: in my case, with ~45.000.000 files, I have Borg using ~11GB of RAM. Also, the startup takes minutes (I guess is the Python unpickling object).

Also I have some swapin/swapon of the process (I guess the data are updated, with all the work of interpreter under the hood).

Anyway, I just kindly ask: using sqlite to store 'em? So, no matter the number of files, the RAM usage is always the same.

Thanks a lot again, Gelma

ThomasWaldmann commented 3 years ago

@Gelma borg accesses these hashtables a lot and my gut feeling about speed is: hashindex (RAM) > hashindex (mmap) >> sqlite (disk).

Especially for non-first backups of mostly unchanged files, the high speed of the in-memory hashtables makes borg very fast.

So the minutes of startup time and some RAM usage may be well invested when the alternative is hours or days longer runtime due to disk accesses.

ThomasWaldmann commented 3 years ago

@RonnyPfannschmidt maybe we can split the "save RAM" (by using less RAM), "mmap compatible" (layout) and "actually use mmap" (only have in RAM what we access) aspects. The first two are likely generally useful, the last only maybe.

Attic author tried to use mmaped hashindex long ago and it was unsuccessful (too slow). But, back then, there was no fadvise DONTNEED and similar stuff in the code.

RonnyPfannschmidt commented 3 years ago

The data structure im proposing will potentially archive both (everything should be able to work without mmap)

I'll have to demo it

borgbackup / borg

reimplement files cache using hashindex code #5756