buchgr / bazel-remote

A remote cache for Bazel
https://bazel.build
Apache License 2.0
607 stars 156 forks source link

Return zst Decoder to sync.pool when finished #770

Open tobbe76 opened 3 months ago

tobbe76 commented 3 months ago

zstd Decoder was not returned to sync.pool so a new instance was allocated each time. This reduced performance. We can now handle higher loads on the bazel-remote server.

mostynb commented 3 months ago

Hi, thanks for the fix- out of curiosity, how did you find this? I wonder if there's a better linter I could be using here.

I propose that we fix this in a slightly different way (see #771) which covers more cases than this PR- in particular, if the err = dec.Reset(pr) line returns a non-nill error in this PR then the decoder won't be returned to the pool.

tobbe76 commented 3 months ago

Thanks @mostynb I'm a golang beginner took a while to understand the sync.pool.

I used http://remote/debug/pprof/heap on the live server and found a huge total count for this object.
I can also see miljons of allocations here https://github.com/buchgr/bazel-remote/blob/e83828a19e6860c498a6a8737bcb78cf452dbe65/cache/disk/casblob/casblob.go#L580 I have added a sync pool here also and tested it locally with good results but not yet tested on the production server. What do you think, is it a good idea?
Why do a fsync here? It's not good for ssd performance. https://github.com/buchgr/bazel-remote/blob/e83828a19e6860c498a6a8737bcb78cf452dbe65/cache/disk/casblob/casblob.go#L637

mostynb commented 3 months ago

Thanks @mostynb I'm a golang beginner took a while to understand the sync.pool.

I used http://remote/debug/pprof/heap on the live server and found a huge total count for this object. I can also see miljons of allocations here

https://github.com/buchgr/bazel-remote/blob/e83828a19e6860c498a6a8737bcb78cf452dbe65/cache/disk/casblob/casblob.go#L580

I have added a sync pool here also and tested it locally with good results but not yet tested on the production server. What do you think, is it a good idea?

Sounds good- do you want to open a new PR for that?

Why do a fsync here? It's not good for ssd performance. https://github.com/buchgr/bazel-remote/blob/e83828a19e6860c498a6a8737bcb78cf452dbe65/cache/disk/casblob/casblob.go#L637

That is done to reduce the chance of cache corruption. I think this is most important for action cache entries which can't be verified by their hash and large/compressed CAS blobs which would be costly to verify.