squashfs: caching of metadata and fragment blocks

ncw commented 8 months ago

I've been using this 3GB squashfs as a performance test.

With the Open optimization in #204 the timings are as follows to unpack the archive

unsquashfs - 9s
rclone with this library - 137s

That is quite a difference!

However I found putting a small LRU cache of metadata blocks and fragment blocks I can reduce that to

rclone with this library and a bit of caching - 17s

Which is a huge difference. The problem is that it is quite CPU expensive to decompress the pages and that comes to dominate the runtime without caching.

I've done a proof of concept in the fix-caching branch.

This works extremely well so I think it is worth considering it for inclusion. However there are some open questions I'd like to discuss before submitting a PR.

How should the cache be configured?

I've arbitrarily set the caches to hold to 64 metadata blocks, 64 fragments blocks cached for 10s. This can use a reasonable amount of memory(2*64*128k = 16M) but that seems affordable.

Should this be configured when creating the FileSystem? Or maybe the user should supply their own caching mechanism. An interface like

type Cache interface {
    Get(pos int64) []byte
    Put(pos int64, buf []byte)
}

Would be enough. Once cache for metadata and one for fragments seems sensible. A cache for the data blocks isn't needed after caching the block in the file handle, but if you were re-opening files then it might be useful too.

Memory use

In practice this seems to use a lot more memory than expected - I suspect it is the go garbage collector being slow to recycle the 128k blocks!

We should probably be recycling the memory blocks - a sync.Pool of fs.blocksize blocks seems sensible and that would probably help with the memory use. I'm not a huge fan sync.Pool having had lots of problems with it in rclone, but if it doesn't work there are lots of alternatives.

Doing this would involve a bit of re-plumbing but I think it would generate quite a small patch.

After the caching, memory copying is the top thing left in the profile!

Caching library

My proof of concept uses the github.com/hashicorp/golang-lru/v2/expirable LRU cache which is widely used but not necessarily the right one.

I chose this just because it is already a dependency of rclone! However I'd be happy to use a different one or write a specific one.

Anti-dogpile

If you look at the code you'll see a small anti-dogpile code (see dogpile effect). This is absolutely essential when using the library multithreaded (like rclone does). There might be a library to do this, but the amount of code is small.

Anyway your thoughts much appreciated.

deitch commented 8 months ago

rclone with this library and a bit of caching - 17s

That still is 2x over unsquashfs, although it still is 8x improvement over current, so a very big step. I would like to get this as efficient as unsquashfs, but I do not have the time to start digging into their code. As I recall, squashfs and unsquashfs are CPU intensive.

Should this be configured when creating the FileSystem? Or maybe the user should supply their own caching mechanism. An interface like

My instinct is to provide it with configurable knobs. It isn't hard to implement (you already did it), it is highly unlikely to be something different people want different implementations for. I would give it to them with a knob (cache memory size) and a reasonable default (as you listed). When someone comes along and says, "but I really want to do my own implementation", we can worry about extracting it then. I don't think that is going to happen.

In practice this seems to use a lot more memory than expected

How much more are we talking? 30%? 75%? 10x?

I suspect it is the go garbage collector being slow to recycle the 128k blocks! After the caching, memory copying is the top thing left in the profile!

Oh, how I dislike trying to get to the bottom of things like that. Exciting when you succeed, frustrating for a while until you do.

My proof of concept uses the github.com/hashicorp/golang-lru/v2/expirable LRU cache which is widely used but not necessarily the right one

Any licensing concerns? Hashi has mostly been keeping libraries real OSS licenses while moving products to BUSL, but no guarantee they will stay that way. Looks like that library is MPL, which I have not used with MIT (this library). I think they are compatible, but haven't checked.

If you look at the code you'll see a small anti-dogpile code

It is just a lock on reading something that might also have to read from the filesystem. When I have done things like that, most of the time I made my cache a read-through cache, so it can handle the lock internally.

ncw commented 8 months ago

rclone with this library and a bit of caching - 17s

That still is 2x over unsquashfs, although it still is 8x improvement over current, so a very big step. I would like to get this as efficient as unsquashfs, but I do not have the time to start digging into their code. As I recall, squashfs and unsquashfs are CPU intensive.

unsquashfs was written by master kernel devs in C so I think to get within 2x is pretty good :-)

Taking a quick squint at the unsquashfs code I see it has a cache keyed on the file position which is the same approach I took. The cache has a maximum size which I couldn't figure out in a quick skim. It also has a free blocks list which is what I think we are missing here.

Should this be configured when creating the FileSystem? Or maybe the user should supply their own caching mechanism. An interface like

My instinct is to provide it with configurable knobs. It isn't hard to implement (you already did it), it is highly unlikely to be something different people want different implementations for. I would give it to them with a knob (cache memory size) and a reasonable default (as you listed). When someone comes along and says, "but I really want to do my own implementation", we can worry about extracting it then. I don't think that is going to happen.

:-)

In practice this seems to use a lot more memory than expected

How much more are we talking? 30%? 75%? 10x?

It seems to take something like 800M rather than the 16M I was expecting! I haven't spent time tracking this down yet...

I suspect it is the go garbage collector being slow to recycle the 128k blocks! After the caching, memory copying is the top thing left in the profile!

Oh, how I dislike trying to get to the bottom of things like that. Exciting when you succeed, frustrating for a while until you do.

Yes!

I had lots of problems like this in rclone, but they were mostly fixed by recycling memory blocks rather than allocing them and getting the garbage collector to free them.

We can do this too - there aren't many places the blocks of memory are allocated - when they are read from disk and when they are decompressed.

My proof of concept uses the github.com/hashicorp/golang-lru/v2/expirable LRU cache which is widely used but not necessarily the right one

Any licensing concerns? Hashi has mostly been keeping libraries real OSS licenses while moving products to BUSL, but no guarantee they will stay that way. Looks like that library is MPL, which I have not used with MIT (this library). I think they are compatible, but haven't checked.

If we are managing the memory more effectively it probably makes sense to combine the cache with the memory recycling so when we want a new block we take the least recently used block out of the cache and use that

If you look at the code you'll see a small anti-dogpile code

It is just a lock on reading something that might also have to read from the filesystem. When I have done things like that, most of the time I made my cache a read-through cache, so it can handle the lock internally.

:+1:

deitch commented 8 months ago

unsquashfs was written by master kernel devs in C so I think to get within 2x is pretty good :-)

When you put it like that, sounds like it is worth a 🍺 to celebrate.

It seems to take something like 800M rather than the 16M I was expecting! I haven't spent time tracking this down yet...

Um, wow. OK, that matters.

I had lots of problems like this in rclone, but they were mostly fixed by recycling memory blocks rather than allocing them and getting the garbage collector to free them.

In other words, doing your own memory management. Pretty soon you will implement your own threads and channels, and you will have your own zig or rust. :-)

If we are managing the memory more effectively it probably makes sense to combine the cache with the memory recycling so when we want a new block we take the least recently used block out of the cache and use that

👍

deitch commented 8 months ago

I did drop you an email; let me know if you did not get it.

ncw commented 8 months ago

I have dropped a PR https://github.com/diskfs/go-diskfs/pull/206 with a configurable cache.

I drilled into the memory problems and yes, it is just Go being slow to reclaim blocks. Setting GOGC=10 makes it use less memory which is quite conclusive.

There might be more memory optimization to be done to check that we aren't pinning the big blocks in memory by having a reference to part of them (eg a byte slice in a FileInfo) but I haven't done that yet.

I did experiment with pools for buffers and re-using memory blocks for the decompressors but it didn't make a significant difference so I left it out of the PR.

I ended up implementing my own LRU cache which means it can do the read through exactly how we want it.

deitch commented 8 months ago

I drilled into the memory problems and yes, it is just Go being slow to reclaim blocks. Setting GOGC=10 makes it use less memory which is quite conclusive.

Yeah, but then we are telling other people how to compile things, when this should just be a library.

Well, if nothing else, I can see you getting a call from one of the financial or big tech firms to be the expert optimizer for their golang services. I know people who have made a mint for years doing that for Java. 😆

ncw commented 8 months ago

I drilled into the memory problems and yes, it is just Go being slow to reclaim blocks. Setting GOGC=10 makes it use less memory which is quite conclusive.

Yeah, but then we are telling other people how to compile things, when this should just be a library.

Setting GOGC is quite common to trade off CPU for memory in go programs. It came up enough times that it made it into the rclone FAQ!

Well, if nothing else, I can see you getting a call from one of the financial or big tech firms to be the expert optimizer for their golang services. I know people who have made a mint for years doing that for Java. 😆

:-) Go is much better behaved than Java in my experience. I believe this is because Go is good at allocating things on the stack with its escape analysis. The Go GC is optimised for low latency which doesn't always mean low memory though.

deitch commented 8 months ago

Setting GOGC is quite common to trade off CPU for memory in go programs. It came up enough times that it made it into the rclone FAQ!

Now that surprises me. I would expect that users never would think of things like that, or if they do, it is a CLI flag.

Go is much better behaved than Java in my experience. I believe this is because Go is good at allocating things on the stack with its escape analysis. The Go GC is optimised for low latency which doesn't always mean low memory though.

My experience at financials (spent a lot of years in them) usually is a happy willingness to trade off resources (CPU+memory) to improve latency.

diskfs / go-diskfs