Closed maleadt closed 2 weeks ago
An easy way to do this would be to have a "cache limit", e.g. 1 GB or whatever you want. At the end of a job, if the total size of the cache is greater than the pre-set limit, you nuke the cache; otherwise you leave the cache alone.
So then most jobs will still benefit from the cache, and we only nuke the cache infrequently.
Of course, we have to make sure we do this at the end of every job, whether or not the job passed.
With large jobs, we'd then risk redownloading those files for every job. Maybe a cron job that checks the cache size, and if exceeding a threshold clean up the oldest entries is a better solution? Not as easy to integrate though.
With large jobs, we'd then risk redownloading those files for every job.
I imagine you're thinking about large artifacts here?
Hmmm. What about having two limits, a "soft limit" and a "hard limit" (where the soft limit is strictly less than the hard limit)? You'd set the soft limit relatively low, to something like maybe 1 GB. The hard limit could be much higher, e.g. 20 GB or even higher.
Then, based on the size of the cache at the end of the job:
$DEPOT/registries
and $DEPOT/packages
directories. But notably, we don't delete any artifacts.clean up the oldest entries
I don't know if "oldest" is the best criteria here... presumably we want something like "least recently used", but I'm not sure we have a way of determining that.
Can you give more of a breakdown for that 189G? How much is packages, artifacts, etc?
clean up the oldest entries
I don't know if "oldest" is the best criteria here... presumably we want something like "least recently used", but I'm not sure we have a way of determining that.
Actually, I guess for artifacts at least, "oldest" and "least recently used" are probably correlated somewhat.
I don't know if "oldest" is the best criteria here... presumably we want something like "least recently used", but I'm not sure we have a way of determining that.
Every job does some Pkg interactions, so we can look at the modification date of manifest_toml
as a proxy:
tbesard@gpuci4:/home/buildkite/rtx4000/.cache/julia-buildkite-plugin$ stat --printf="%Y,%n\n" depots/*/logs/manifest_usage.toml | sort
1617853685,depots/f32ff181-b4dc-44bc-a814-c84d5a57b537/logs/manifest_usage.toml
1618263085,depots/13269922-905b-43eb-b320-497ed14a4630/logs/manifest_usage.toml
1619698937,depots/8b806e16-5332-4385-9cb4-b4e7611f4407/logs/manifest_usage.toml
1621472319,depots/5197e118-9f00-4a08-90a6-92b164f53cbc/logs/manifest_usage.toml
1622785946,depots/2efef35e-4230-4eac-bebf-be6944f1dafd/logs/manifest_usage.toml
1622881578,depots/6b2494c7-f883-4e90-afd6-c27730937a3f/logs/manifest_usage.toml
1624554218,depots/6b36cfbd-2087-4708-afa6-5b44842f108d/logs/manifest_usage.toml
1624777137,depots/cf2e0b35-7914-4126-9ca2-f67c49269522/logs/manifest_usage.toml
1625000529,depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/logs/manifest_usage.toml
1625735323,depots/35be44f1-0cc0-43a0-8017-dbc23b648d1d/logs/manifest_usage.toml
1625793163,depots/3105e5d3-28f0-4cf0-b90b-02786f04b8f6/logs/manifest_usage.toml
1626461832,depots/dc18a9a2-eed5-4c7e-b514-fdcbd06a5a91/logs/manifest_usage.toml
1626671946,depots/d7371c7e-7c2c-45ee-b838-bbfcb0d5f242/logs/manifest_usage.toml
1626707422,depots/8fb8add9-7eaa-4c04-8daa-9bfbe283579d/logs/manifest_usage.toml
1626722132,depots/7d03d9bf-a71f-49c4-a3ad-3b148c7d678f/logs/manifest_usage.toml
1626778981,depots/121c0c35-6530-4d8c-a6f7-4b1e70e523fa/logs/manifest_usage.toml
1626820595,depots/e859b8c3-5568-49aa-8ceb-b23a1bb4fc53/logs/manifest_usage.toml
1626869035,depots/d4264945-9bae-4dd2-a715-3cee20da2dbf/logs/manifest_usage.toml
1627133829,depots/392153f5-bf0f-4db9-8d44-7f7ff44a36b8/logs/manifest_usage.toml
1627215710,depots/36c2797c-f1dd-4249-8f83-24e632087b32/logs/manifest_usage.toml
1627311589,depots/0f89da95-bc84-4add-8ec2-8b5645d50d93/logs/manifest_usage.toml
1627325737,depots/9c4ff4c4-1e2d-49a4-b1ab-2e8221967d27/logs/manifest_usage.toml
1627370272,depots/99893f3b-a062-4009-96c7-7c68e1eff34a/logs/manifest_usage.toml
1627378993,depots/434649b0-c238-47db-be11-cc2d12bef086/logs/manifest_usage.toml
1627384609,depots/c9f52312-b528-44e4-9501-6d408762012b/logs/manifest_usage.toml
1627390295,depots/f8da2e12-18ea-4414-879b-afc071467714/logs/manifest_usage.toml
1627400150,depots/64dbdc29-d6e3-4071-807c-a2eda6e09bd8/logs/manifest_usage.toml
1627411933,depots/5923eca4-80f3-4fa8-9b76-df98dab39335/logs/manifest_usage.toml
1627422264,depots/3a53e4c4-2499-448a-895e-72e547de0dd0/logs/manifest_usage.toml
1627475734,depots/ea52448d-f230-4619-b27a-2d98107bd215/logs/manifest_usage.toml
1627551569,depots/3cc01fab-3357-4a7a-9294-cde2d3115a97/logs/manifest_usage.toml
Can you give more of a breakdown for that 189G? How much is packages, artifacts, etc?
Most are artifacts. A Pkg.gc()
presumably would do a lot already.
I don't know if "oldest" is the best criteria here... presumably we want something like "least recently used", but I'm not sure we have a way of determining that.
Every job does some Pkg interactions, so we can look at the modification date of
manifest_toml
as a proxy:tbesard@gpuci4:/home/buildkite/rtx4000/.cache/julia-buildkite-plugin$ stat --printf="%Y,%n\n" depots/*/logs/manifest_usage.toml | sort 1617853685,depots/f32ff181-b4dc-44bc-a814-c84d5a57b537/logs/manifest_usage.toml 1618263085,depots/13269922-905b-43eb-b320-497ed14a4630/logs/manifest_usage.toml 1619698937,depots/8b806e16-5332-4385-9cb4-b4e7611f4407/logs/manifest_usage.toml 1621472319,depots/5197e118-9f00-4a08-90a6-92b164f53cbc/logs/manifest_usage.toml 1622785946,depots/2efef35e-4230-4eac-bebf-be6944f1dafd/logs/manifest_usage.toml 1622881578,depots/6b2494c7-f883-4e90-afd6-c27730937a3f/logs/manifest_usage.toml 1624554218,depots/6b36cfbd-2087-4708-afa6-5b44842f108d/logs/manifest_usage.toml 1624777137,depots/cf2e0b35-7914-4126-9ca2-f67c49269522/logs/manifest_usage.toml 1625000529,depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/logs/manifest_usage.toml 1625735323,depots/35be44f1-0cc0-43a0-8017-dbc23b648d1d/logs/manifest_usage.toml 1625793163,depots/3105e5d3-28f0-4cf0-b90b-02786f04b8f6/logs/manifest_usage.toml 1626461832,depots/dc18a9a2-eed5-4c7e-b514-fdcbd06a5a91/logs/manifest_usage.toml 1626671946,depots/d7371c7e-7c2c-45ee-b838-bbfcb0d5f242/logs/manifest_usage.toml 1626707422,depots/8fb8add9-7eaa-4c04-8daa-9bfbe283579d/logs/manifest_usage.toml 1626722132,depots/7d03d9bf-a71f-49c4-a3ad-3b148c7d678f/logs/manifest_usage.toml 1626778981,depots/121c0c35-6530-4d8c-a6f7-4b1e70e523fa/logs/manifest_usage.toml 1626820595,depots/e859b8c3-5568-49aa-8ceb-b23a1bb4fc53/logs/manifest_usage.toml 1626869035,depots/d4264945-9bae-4dd2-a715-3cee20da2dbf/logs/manifest_usage.toml 1627133829,depots/392153f5-bf0f-4db9-8d44-7f7ff44a36b8/logs/manifest_usage.toml 1627215710,depots/36c2797c-f1dd-4249-8f83-24e632087b32/logs/manifest_usage.toml 1627311589,depots/0f89da95-bc84-4add-8ec2-8b5645d50d93/logs/manifest_usage.toml 1627325737,depots/9c4ff4c4-1e2d-49a4-b1ab-2e8221967d27/logs/manifest_usage.toml 1627370272,depots/99893f3b-a062-4009-96c7-7c68e1eff34a/logs/manifest_usage.toml 1627378993,depots/434649b0-c238-47db-be11-cc2d12bef086/logs/manifest_usage.toml 1627384609,depots/c9f52312-b528-44e4-9501-6d408762012b/logs/manifest_usage.toml 1627390295,depots/f8da2e12-18ea-4414-879b-afc071467714/logs/manifest_usage.toml 1627400150,depots/64dbdc29-d6e3-4071-807c-a2eda6e09bd8/logs/manifest_usage.toml 1627411933,depots/5923eca4-80f3-4fa8-9b76-df98dab39335/logs/manifest_usage.toml 1627422264,depots/3a53e4c4-2499-448a-895e-72e547de0dd0/logs/manifest_usage.toml 1627475734,depots/ea52448d-f230-4619-b27a-2d98107bd215/logs/manifest_usage.toml 1627551569,depots/3cc01fab-3357-4a7a-9294-cde2d3115a97/logs/manifest_usage.toml
Each different depot is a different pipeline, right?
So this tells you which pipelines were run most recently, but does it tell you which specific artifacts were used most recently?
Can you give more of a breakdown for that 189G? How much is packages, artifacts, etc?
Most are artifacts. A
Pkg.gc()
presumably would do a lot already.
Yeah perhaps we can have a post_command
that just does Pkg.gc(Day(2))
or something like that?
I do think we should still have some "worst case" safety net, e.g. if the cache hits 30 GB or whatever, nuke the whole cache. Just as a fallback to make sure we don't get a "run out of space" error again. In an ideal world we never hit this fallback.
Looking at Pkg.jl, I think we don't clean-up most artifacts because they are orphaned (since we always clean most of the depot at the start of each job), and the code currently waits for artifacts to be orphaned for a while: https://github.com/JuliaLang/Pkg.jl/blob/33fa5d7fad6bc276326b2ef711f4dcee084438c1/src/API.jl#L784-L786 https://github.com/JuliaLang/Pkg.jl/blob/33fa5d7fad6bc276326b2ef711f4dcee084438c1/src/API.jl#L724-L729. I guess running GC twice with a very small collect delay would trigger that, but it also may delete too much other stuff. But maybe that's better than deleting the entire depot every couple of weeks?
FWIW, this is still an issue, and gpuci
regularly runs out of space. Over the course of a couple of months, the 16 buildkite workers or so created a 3TB cache. Most of that is in the several depots, containing lots of CUDA artifacts that apparently don't get collected.
We really ought to fix this, as Base runners are now also starting to suffer from this (notably the ones that are also part of the juliaecosystem
queue). For example, on amdci4
one of the tester
agents has a 175GB julia-buildkite-plugin
cache, almost all of which taken up by depots, the largest of which is around 16GiB comprised of 15.5GiB precompilation caches (with Enzyme being the main culprit here).
Over time, lots of files (mostly registries) are downloaded for each unique buildkite pipeline, which takes up quite some space. I just had gpuci3 run out of space due to that, so here's a look at gpuci4:
This is compounded by the fact that some of these machines have several buildkite agents, each for every GPU.
Interestingly, a nontrivial amount of disk space (about 10%) is due to coverage files:
These are written both in the
julia_installs
dir (which shouldn't happen, https://github.com/JuliaLang/julia/issues/26573), but most come from within thepackages
depot dir.Bottom line, maybe we should:
.cache