JuliaCI / julia-buildkite-plugin

Buildkite plugin to install Julia for use in a pipeline.
4 stars 5 forks source link

Disk usage #17

Closed maleadt closed 2 weeks ago

maleadt commented 3 years ago

Over time, lots of files (mostly registries) are downloaded for each unique buildkite pipeline, which takes up quite some space. I just had gpuci3 run out of space due to that, so here's a look at gpuci4:

tbesard@gpuci4:/home/buildkite/rtx4000/.cache$ df -h julia-buildkite-plugin
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       916G  227G  643G  27% /
tbesard@gpuci4:/home/buildkite/rtx4000/.cache$ du -hs julia-buildkite-plugin
189G    julia-buildkite-plugin

This is compounded by the fact that some of these machines have several buildkite agents, each for every GPU.

Interestingly, a nontrivial amount of disk space (about 10%) is due to coverage files:

tbesard@gpuci4:/home/buildkite/rtx4000/.cache$ find julia-buildkite-plugin -type f -name "*.cov" -printf "%s\n" | gawk -M '{t+=$1}END{print t}'
12774282171
// or 12GB

These are written both in the julia_installs dir (which shouldn't happen, https://github.com/JuliaLang/julia/issues/26573), but most come from within the packages depot dir.

Bottom line, maybe we should:

DilumAluthge commented 3 years ago

An easy way to do this would be to have a "cache limit", e.g. 1 GB or whatever you want. At the end of a job, if the total size of the cache is greater than the pre-set limit, you nuke the cache; otherwise you leave the cache alone.

So then most jobs will still benefit from the cache, and we only nuke the cache infrequently.

Of course, we have to make sure we do this at the end of every job, whether or not the job passed.

maleadt commented 3 years ago

With large jobs, we'd then risk redownloading those files for every job. Maybe a cron job that checks the cache size, and if exceeding a threshold clean up the oldest entries is a better solution? Not as easy to integrate though.

DilumAluthge commented 3 years ago

With large jobs, we'd then risk redownloading those files for every job.

I imagine you're thinking about large artifacts here?

Hmmm. What about having two limits, a "soft limit" and a "hard limit" (where the soft limit is strictly less than the hard limit)? You'd set the soft limit relatively low, to something like maybe 1 GB. The hard limit could be much higher, e.g. 20 GB or even higher.

Then, based on the size of the cache at the end of the job:

  1. If the cache is below the soft limit, do nothing.
  2. If the cache is above the soft limit but below the hard limit, only delete certain files and directories. For example, maybe we delete only the $DEPOT/registries and $DEPOT/packages directories. But notably, we don't delete any artifacts.
  3. If the cache is above the hard limit, nuke the entire cache.
DilumAluthge commented 3 years ago

clean up the oldest entries

I don't know if "oldest" is the best criteria here... presumably we want something like "least recently used", but I'm not sure we have a way of determining that.

DilumAluthge commented 3 years ago

Can you give more of a breakdown for that 189G? How much is packages, artifacts, etc?

DilumAluthge commented 3 years ago

clean up the oldest entries

I don't know if "oldest" is the best criteria here... presumably we want something like "least recently used", but I'm not sure we have a way of determining that.

Actually, I guess for artifacts at least, "oldest" and "least recently used" are probably correlated somewhat.

maleadt commented 3 years ago

I don't know if "oldest" is the best criteria here... presumably we want something like "least recently used", but I'm not sure we have a way of determining that.

Every job does some Pkg interactions, so we can look at the modification date of manifest_toml as a proxy:

tbesard@gpuci4:/home/buildkite/rtx4000/.cache/julia-buildkite-plugin$ stat --printf="%Y,%n\n" depots/*/logs/manifest_usage.toml | sort
1617853685,depots/f32ff181-b4dc-44bc-a814-c84d5a57b537/logs/manifest_usage.toml
1618263085,depots/13269922-905b-43eb-b320-497ed14a4630/logs/manifest_usage.toml
1619698937,depots/8b806e16-5332-4385-9cb4-b4e7611f4407/logs/manifest_usage.toml
1621472319,depots/5197e118-9f00-4a08-90a6-92b164f53cbc/logs/manifest_usage.toml
1622785946,depots/2efef35e-4230-4eac-bebf-be6944f1dafd/logs/manifest_usage.toml
1622881578,depots/6b2494c7-f883-4e90-afd6-c27730937a3f/logs/manifest_usage.toml
1624554218,depots/6b36cfbd-2087-4708-afa6-5b44842f108d/logs/manifest_usage.toml
1624777137,depots/cf2e0b35-7914-4126-9ca2-f67c49269522/logs/manifest_usage.toml
1625000529,depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/logs/manifest_usage.toml
1625735323,depots/35be44f1-0cc0-43a0-8017-dbc23b648d1d/logs/manifest_usage.toml
1625793163,depots/3105e5d3-28f0-4cf0-b90b-02786f04b8f6/logs/manifest_usage.toml
1626461832,depots/dc18a9a2-eed5-4c7e-b514-fdcbd06a5a91/logs/manifest_usage.toml
1626671946,depots/d7371c7e-7c2c-45ee-b838-bbfcb0d5f242/logs/manifest_usage.toml
1626707422,depots/8fb8add9-7eaa-4c04-8daa-9bfbe283579d/logs/manifest_usage.toml
1626722132,depots/7d03d9bf-a71f-49c4-a3ad-3b148c7d678f/logs/manifest_usage.toml
1626778981,depots/121c0c35-6530-4d8c-a6f7-4b1e70e523fa/logs/manifest_usage.toml
1626820595,depots/e859b8c3-5568-49aa-8ceb-b23a1bb4fc53/logs/manifest_usage.toml
1626869035,depots/d4264945-9bae-4dd2-a715-3cee20da2dbf/logs/manifest_usage.toml
1627133829,depots/392153f5-bf0f-4db9-8d44-7f7ff44a36b8/logs/manifest_usage.toml
1627215710,depots/36c2797c-f1dd-4249-8f83-24e632087b32/logs/manifest_usage.toml
1627311589,depots/0f89da95-bc84-4add-8ec2-8b5645d50d93/logs/manifest_usage.toml
1627325737,depots/9c4ff4c4-1e2d-49a4-b1ab-2e8221967d27/logs/manifest_usage.toml
1627370272,depots/99893f3b-a062-4009-96c7-7c68e1eff34a/logs/manifest_usage.toml
1627378993,depots/434649b0-c238-47db-be11-cc2d12bef086/logs/manifest_usage.toml
1627384609,depots/c9f52312-b528-44e4-9501-6d408762012b/logs/manifest_usage.toml
1627390295,depots/f8da2e12-18ea-4414-879b-afc071467714/logs/manifest_usage.toml
1627400150,depots/64dbdc29-d6e3-4071-807c-a2eda6e09bd8/logs/manifest_usage.toml
1627411933,depots/5923eca4-80f3-4fa8-9b76-df98dab39335/logs/manifest_usage.toml
1627422264,depots/3a53e4c4-2499-448a-895e-72e547de0dd0/logs/manifest_usage.toml
1627475734,depots/ea52448d-f230-4619-b27a-2d98107bd215/logs/manifest_usage.toml
1627551569,depots/3cc01fab-3357-4a7a-9294-cde2d3115a97/logs/manifest_usage.toml
maleadt commented 3 years ago

Can you give more of a breakdown for that 189G? How much is packages, artifacts, etc?

Most are artifacts. A Pkg.gc() presumably would do a lot already.

DilumAluthge commented 3 years ago

I don't know if "oldest" is the best criteria here... presumably we want something like "least recently used", but I'm not sure we have a way of determining that.

Every job does some Pkg interactions, so we can look at the modification date of manifest_toml as a proxy:

tbesard@gpuci4:/home/buildkite/rtx4000/.cache/julia-buildkite-plugin$ stat --printf="%Y,%n\n" depots/*/logs/manifest_usage.toml | sort
1617853685,depots/f32ff181-b4dc-44bc-a814-c84d5a57b537/logs/manifest_usage.toml
1618263085,depots/13269922-905b-43eb-b320-497ed14a4630/logs/manifest_usage.toml
1619698937,depots/8b806e16-5332-4385-9cb4-b4e7611f4407/logs/manifest_usage.toml
1621472319,depots/5197e118-9f00-4a08-90a6-92b164f53cbc/logs/manifest_usage.toml
1622785946,depots/2efef35e-4230-4eac-bebf-be6944f1dafd/logs/manifest_usage.toml
1622881578,depots/6b2494c7-f883-4e90-afd6-c27730937a3f/logs/manifest_usage.toml
1624554218,depots/6b36cfbd-2087-4708-afa6-5b44842f108d/logs/manifest_usage.toml
1624777137,depots/cf2e0b35-7914-4126-9ca2-f67c49269522/logs/manifest_usage.toml
1625000529,depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/logs/manifest_usage.toml
1625735323,depots/35be44f1-0cc0-43a0-8017-dbc23b648d1d/logs/manifest_usage.toml
1625793163,depots/3105e5d3-28f0-4cf0-b90b-02786f04b8f6/logs/manifest_usage.toml
1626461832,depots/dc18a9a2-eed5-4c7e-b514-fdcbd06a5a91/logs/manifest_usage.toml
1626671946,depots/d7371c7e-7c2c-45ee-b838-bbfcb0d5f242/logs/manifest_usage.toml
1626707422,depots/8fb8add9-7eaa-4c04-8daa-9bfbe283579d/logs/manifest_usage.toml
1626722132,depots/7d03d9bf-a71f-49c4-a3ad-3b148c7d678f/logs/manifest_usage.toml
1626778981,depots/121c0c35-6530-4d8c-a6f7-4b1e70e523fa/logs/manifest_usage.toml
1626820595,depots/e859b8c3-5568-49aa-8ceb-b23a1bb4fc53/logs/manifest_usage.toml
1626869035,depots/d4264945-9bae-4dd2-a715-3cee20da2dbf/logs/manifest_usage.toml
1627133829,depots/392153f5-bf0f-4db9-8d44-7f7ff44a36b8/logs/manifest_usage.toml
1627215710,depots/36c2797c-f1dd-4249-8f83-24e632087b32/logs/manifest_usage.toml
1627311589,depots/0f89da95-bc84-4add-8ec2-8b5645d50d93/logs/manifest_usage.toml
1627325737,depots/9c4ff4c4-1e2d-49a4-b1ab-2e8221967d27/logs/manifest_usage.toml
1627370272,depots/99893f3b-a062-4009-96c7-7c68e1eff34a/logs/manifest_usage.toml
1627378993,depots/434649b0-c238-47db-be11-cc2d12bef086/logs/manifest_usage.toml
1627384609,depots/c9f52312-b528-44e4-9501-6d408762012b/logs/manifest_usage.toml
1627390295,depots/f8da2e12-18ea-4414-879b-afc071467714/logs/manifest_usage.toml
1627400150,depots/64dbdc29-d6e3-4071-807c-a2eda6e09bd8/logs/manifest_usage.toml
1627411933,depots/5923eca4-80f3-4fa8-9b76-df98dab39335/logs/manifest_usage.toml
1627422264,depots/3a53e4c4-2499-448a-895e-72e547de0dd0/logs/manifest_usage.toml
1627475734,depots/ea52448d-f230-4619-b27a-2d98107bd215/logs/manifest_usage.toml
1627551569,depots/3cc01fab-3357-4a7a-9294-cde2d3115a97/logs/manifest_usage.toml

Each different depot is a different pipeline, right?

So this tells you which pipelines were run most recently, but does it tell you which specific artifacts were used most recently?

DilumAluthge commented 3 years ago

Can you give more of a breakdown for that 189G? How much is packages, artifacts, etc?

Most are artifacts. A Pkg.gc() presumably would do a lot already.

Yeah perhaps we can have a post_command that just does Pkg.gc(Day(2)) or something like that?

I do think we should still have some "worst case" safety net, e.g. if the cache hits 30 GB or whatever, nuke the whole cache. Just as a fallback to make sure we don't get a "run out of space" error again. In an ideal world we never hit this fallback.

maleadt commented 3 years ago

Looking at Pkg.jl, I think we don't clean-up most artifacts because they are orphaned (since we always clean most of the depot at the start of each job), and the code currently waits for artifacts to be orphaned for a while: https://github.com/JuliaLang/Pkg.jl/blob/33fa5d7fad6bc276326b2ef711f4dcee084438c1/src/API.jl#L784-L786 https://github.com/JuliaLang/Pkg.jl/blob/33fa5d7fad6bc276326b2ef711f4dcee084438c1/src/API.jl#L724-L729. I guess running GC twice with a very small collect delay would trigger that, but it also may delete too much other stuff. But maybe that's better than deleting the entire depot every couple of weeks?

maleadt commented 8 months ago

FWIW, this is still an issue, and gpuci regularly runs out of space. Over the course of a couple of months, the 16 buildkite workers or so created a 3TB cache. Most of that is in the several depots, containing lots of CUDA artifacts that apparently don't get collected.

maleadt commented 2 weeks ago

We really ought to fix this, as Base runners are now also starting to suffer from this (notably the ones that are also part of the juliaecosystem queue). For example, on amdci4 one of the tester agents has a 175GB julia-buildkite-plugin cache, almost all of which taken up by depots, the largest of which is around 16GiB comprised of 15.5GiB precompilation caches (with Enzyme being the main culprit here).