Stasher cleanup is innefficient in production workloads

danmidwood commented 5 years ago

Bug: When stasher's cleaning routine removes files from the cache it leaves empty directories in place. These clog the disk (each one is 4k), and they also affect Stasher's cleaning process (high memory usage, lots of garbage collecting).

Interesting Story

On Thursday 27th June 2019 we received alerts from the Scotland Production environment because Stasher's cleanup routine was using lots of memory, reaching the heap limit and triggering and retriggering garbage collection.

The cleanup process reads the file metadata for all of the files in the cache, to count the size used and decide what to remove. This file data was what was using all of the memory.

Investigating the cache on the disk showed a lot of empty folders that had been left behind from previous cleanups. These were being included in the read, which meant that a lot more file data than necessary was being loaded into memory. Hence the garbage collecting.

The cache folder on disk measured 47GB

dan@sg-production-2019-feb:~$ sudo du -sh /var/lib/stardog-home/stasher/cache
47G    /var/lib/stardog-home/stasher/cache

I removed the empty folders and the cache measured 28GB

dan@sg-production-2019-feb:~$ sudo find /var/lib/stardog-home/stasher-old/cache/ -type d -empty -delete

dan@sg-production-2019-feb:~$ sudo du -sh /var/lib/stardog-home/stasher-old/cache/
28G /var/lib/stardog-home/stasher-old/cache/

I transferred the cache over to my local machine and ran the stasher file/disk counter and the cache came to ~6.74GB

2019-06-28 10:50:35 INFO  cache-clearer:286 - Archiving at 8589934592 (8GB)
2019-06-28 11:02:19 INFO  cache-clearer:286 - Cache size is 7246341214 (6.74GB)

So we had ~40GB of directory tree structure before, and ~21GB after. All supporting just 6.74GB of actual cache.

RickMoynihan commented 5 years ago

Good debugging @danmidwood, what do you propose we do for a fix? Presumably we just need to remove the parent directories when there is no longer a cached file in them?

danmidwood commented 5 years ago

That's a good start.

I think we should also re-evaluate whether we want to retain this directory structure. We use the paths to encode data (timestamps, query hashes), which is quite necessary. But doing this has the fs allocating 4kb per directory, with the cached data often being a lot less.

As mentioned above, even after cleaning out all of the empty directories the cache size remaining on disk was 21GB, made up of less than 7GB of data, and ~14GB of directories.

RickMoynihan commented 5 years ago

I think we should also re-evaluate whether we want to retain this directory structure. We use the paths to encode data (timestamps, query hashes).

What are our options though? I think ultimately there are three we either 1) use some kind of blob database or 2) use the file system or 3) use a database of metadata and the filesystem.

If we're stick with 2 then I think we need the first two directories /4c/3a, which will likely be shared in a cache of sufficient size anyway. As without them we'll start hitting basic limitations with lsing lots of files etc. So unless there's another easy workaround for those issues, the only option seems to be to be to merge the final sha and timestamp together in the filename, saving 4kb per key, i.e. represent 57/4a/574a2bddab62b6916ce9e37961d0f7ca/1554894110922.brt as 57/4a/574a2bddab62b6916ce9e37961d0f7ca-1554894110922.brt and split on -. Or we could look at merging the first two directories, e.g. 574a instead of 57/4a/, though I suspect that would only save about 200mb of space on a large cache (i.e. any cache with 65k+ cache keys).

It would be interesting to know how many files were in the active cache, so we can see the savings we'd get by redistributing the keys across directories, e.g. storing as 57/4/574a2bddab62b6916ce9e37961d0f7ca/1554894110922.brt might also save us a lot but reduce the files per directory by a factor of 16x.

Ultimately though I'm beginning to think this is an inefficiency we can live with, so long as we clean up the dead directories properly. Storage is cheap after all.

RickMoynihan commented 4 years ago

Another interesting story

https://swirrl.slack.com/archives/C9DCVKU4F/p1582139566044600

TL;DR: Our stasher cache on scotland hit the max number of files allowed on the filesystem limit. i.e. we ran out of inode's. The ext4 volume we mounted this on, had an inode limit of 16M; with stasher filling up these. i.e. we did not run out of space (though the error messages misleadingly say we have), we ran out of inodes, as reported by df -i.

This is a similar issue to above; but actually very different. I mention it because any resolution to the above issue should also aim to resolve this one. From slack #ops-alerts

[rickmoynihan] Ok looks like the cache clearer hasn’t archived any files over the past few days… Yet it looks like it should try and delete files with a 30 minute delay between clean attempts, when the cache it at 80% of 10gb, and try and find candidates to remove until it is at 60% of 10gb. Incidentally I did see a handful of files in the archive (where files go before deletion), which I manually deleted too… Though there were maybe only 20 files there.

I took a quick look at the code and summarise the approach roughly as this (it's slightly innaccurate so whoever properly looks at this should take this as a starting point for investigation/questioning):

When finding candidates it tries to first find “expired files” which looks like it's the least recently accessed file. I might be reading the code wrong here; but it looks like this is a single file (not many)!!! (Flagging just one file to delete for this criteria every 30 minutes seems suboptimal), as we may create 10,000's of entries in that time.

Then it attempts to find a set of “old files” that try to fill its % quota for deletion. Then it moves these files to the archive, for later deletion.

When finding candidates it tries to first find “expired files” which is the least recently accessed file. I might be reading the code wrong here; but it looks like this is a single file (not many)!!! Then it attempts to find a set of “old files” that try to fill its % quota for deletion. Then it moves these files to the archive, for later deletion.

Looking at the code here I see there are likely a number of problems.

We didn’t expect quite so many files to be in the cache. (It also seems most files are smaller than 4kb).
We had nearly 16m files in the cache and I suspect our code to list all files, and our code to sort them by access time etc… is really inefficient… it might be taking a really long time to finish.
I don’t think we expected so many small files. Which seem to be >99% of them. I suspect this means our code to calculate disk usage isn’t working as expected (or is taking too long to work). We appear to bias deleting big files; rather than trying to balance the number of cache entries. I think if we can also control the number of cache entries; i.e. keeping it at maybe a few million tops, and arbitrarily deleting entries that are outside of this, this might help.

We could perhaps first try to keep the number of cache keys down; by randomly selecting a subtree from the first 4 bytes of the hash space; and then try to expunge infrequently accessed ones from within that. i.e. I think an approach that's based on sampling the hash from the top, rather than assembling a list of all the files is needed.

Incidentally even unix tools like find . | wc -l take a long, long time (maybe 20+ minutes) to iterate over all the files in the cache on scotland.

RickMoynihan commented 4 years ago

I suspect these lines are problematic with large numbers of files:

https://github.com/Swirrl/drafter/blob/30a41e46af3925c967fbcc570230d326c34d6934/drafter/src/drafter/stasher/cache_clearer.clj#L35-L42

I think if this listed files drawn from a sample randomly selected from the first 2 or 4bytes of the hash (two top level directories) it would be much better as with a cache of 16m files each batch would contain approximately 65k or 256 files.

And also this (function name is misleading as it takes a predicate):

https://github.com/Swirrl/drafter/blob/30a41e46af3925c967fbcc570230d326c34d6934/drafter/src/drafter/stasher/cache_clearer.clj#L44-L45

because it lists all files and sorts 16m things; then walks 16m things to find the last one. It would be better to use a comparator that reversed the ordering in sort.

RickMoynihan commented 4 years ago

Same interesting inode story (again)

https://swirrl.slack.com/archives/C1TDYMK9Q/p1585819466005400

RickMoynihan commented 2 years ago

Seen this again on ONS's beta.gss-data.org.uk site.

This time it was inodes being 100% consumed by the cache again, rather than space.

$ df -i
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
...
/dev/sdc        6553600 6553600        0  100% /var/lib/pmd-data
...

Solution was to run rm -rf cache/* on the stasher cache, and the site came back up again after a few seconds... It takes a long time delete the cache. You can maybe free 1% of the inodes every 4seconds with rm (this isn't a problem just an observation).

Swirrl / drafter