GoogleCloudPlatform / bank-of-anthos

Retail banking sample application showcasing Kubernetes and Google Cloud
https://cymbal-bank.fsi.cymbal.dev
Apache License 2.0
981 stars 580 forks source link

actions runner: no space left on device #344

Closed murog closed 3 years ago

murog commented 4 years ago

Description

CI is intermittently failing when ran on first actions runner with No space left on device error

Current Behavior

https://github.com/GoogleCloudPlatform/bank-of-anthos/actions/runs/238157194

System.IO.IOException: No space left on device
   at System.IO.FileStream.WriteNative(ReadOnlySpan`1 source)
Unhandled exception. System.IO.IOException: No space left on device
   at System.IO.FileStream.FlushWriteBuffer()
   at System.IO.FileStream.Flush(Boolean flushToDisk)
   at System.IO.FileStream.WriteNative(ReadOnlySpan`1 source)
   at System.IO.FileStream.FlushWriteBuffer()
   at System.IO.FileStream.Flush(Boolean flushToDisk)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)

Expected Behavior

CI should have enough available storage to run

murog commented 4 years ago

Looks like memory is being eaten up by:

Was hoping that docker system prune -a would also clear what is in the overlay dir, but it only freed up ~9.5GB This temporarily frees up enough space for CI to run, but it is still high use (80%)

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   46G  1.3G  98% /
overlay          50G   46G  1.3G  98% /var/lib/docker/overlay2/a21c942...6a33/merge
d
$ docker system df
TYPE                TOTAL               ACTIVE              SIZE                RECLAIMABLE
Images              53                  3                   9.325GB             3.235GB (34%)
Containers          38                  1                   128.7MB             125.7MB (97%)
Local Volumes       37                  1                   6.474GB             0B (0%)
Build Cache         0                   0                   0B                  0B

second runner: isn't maxed out yet but is trending towards:

$ docker system df
TYPE                TOTAL               ACTIVE              SIZE                RECLAIMABLE
Images              130                 2                   7.648GB             4.018GB (52%)
Containers          9                   1                   31.14MB             28.07MB (90%)
Local Volumes       14                  1                   5.927GB             0B (0%)
Build Cache         0                   0                   0B                  0B
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   40G  7.9G  84% /
overlay          50G   40G  7.9G  84% /var/lib/docker/overlay2/304fef...dc83b/merged
askmeegs commented 4 years ago

Steps I took to clean up on actions runner 1:

  1. Removed Kind container: ran docker ps --> saw that the KIND control plane was still running (no longer needed since we're not using Kind in deploy tests anymore.) I removed the image and removed the volumes: docker volume rm $(docker volume ls -q), then re-ran df -h, reduced usage by ~15%
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   39G  8.4G  83% /

Then re-ran docker system prune -a but added the -f (force) flag:

Total reclaimed space: 5.887GB

/dev/sda1        50G   33G   15G  70% /

Did the same steps on actions runner 2.

  1. removed kind container + unmounted volumes
  2. docker system prune -a -f -->
Total reclaimed space: 10.15GB

/dev/sda1        50G   32G   16G  68% /

  1. Cleaned up the /tmp directory on both runners. From root, I ran sudo du -h --max-depth=1 and saw that the /tmp directory was taking up 15G, due to a set of 1GB image-tar files created from March to present. Not sure where these are coming from, maybe a tool install / upgrade? I opened a new issue to try to figure out where these .tars are coming from, so we can auto-remove them in our scripts in the future**

Now, actions runner 1:

/dev/sda1        50G   19G   29G  40% /

Runner 2:

/dev/sda1        50G   19G   29G  39% /
askmeegs commented 3 years ago

Closing this in favor of #351 (automate cleaning up the runner disks) - to address at a later date.