envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.13k stars 4.82k forks source link

Set up proper caching for mobile CI #26970

Open phlax opened 1 year ago

phlax commented 1 year ago

Recently, we have added a bunch of caching to Envoy's CI which has both sped things up and fixed (or seriously mittigated) a load of persistent flakes that have been endemic (optimistically i would say until ~now)

Essentially there are a few things worth caching:

Regarding the docker image it is huge - to make it work faster than pulling from dockerhub in Envoy's CI i have had to load it using tmpfs - any other way would be significantly slower

I held off doing this for a long time because dockerhub can be faster - but now that we have done this it has been a huge improvement to both performance and reliability. Anecdotally i would say that the azp cache gets faster the more you use it whereas the reverse seems to be true with the dockerhub cache (with some caveats)

Likewise with the bazel directories, a huge number of flakes occur just because of transient network issues, and caching these mostly eliminates this problem

While working on this for Envoy one thing i noticed was that incorrect use of bazel was creating multiple environments, fixing that was a small but sure optimization. Envoy now dumps pretty good fs info at the end of the build, and this is invaluable to understanding what bazel is doing and why it goes wrong - i strongly suggest we do the same for the mobile ci

phlax commented 1 year ago

one consideration if we setup the cache is where - with azp you get a free unlimited cache - but for our self-hosted machines its not nearly as fast as it is for the ms-hosted ones

with github actions - setting up an s3 bucket cache is an option

also not sure what spec the machines have - iiuc they are all ms-hosted ones which means that getting a build image as large as the mobile one into a running docker quickly may be hard - tmpfs may not be an option

if any have spare RAM capacity we could also just mount some of the bazel dirs tmpfs as we have done for envoy

phlax commented 1 year ago

also related to this - ive been working on thinning out the build images used by ci

if mobile ci uses RBE there is a good chance that it doesnt need llvm, which means we could reduce the mobile build image size significantly

https://github.com/envoyproxy/envoy-build-tools/pull/204

phlax commented 1 year ago

as if to underline this point, the build image has been unavailable for the last few hours

Starting job container
  /usr/bin/docker pull envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757
  Error response from daemon: manifest for envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757 not found: manifest unknown: manifest unknown
  Warning: Docker pull failed with exit code 1, back off 9.731 seconds before retry.
  /usr/bin/docker pull envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757
  Error response from daemon: manifest for envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757 not found: manifest unknown: manifest unknown
  Warning: Docker pull failed with exit code 1, back off 6.28 seconds before retry.
  /usr/bin/docker pull envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757
  Error response from daemon: manifest for envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757 not found: manifest unknown: manifest unknown

https://github.com/envoyproxy/envoy/actions/runs/4805922388/jobs/8554285389?pr=26969#step:2:26

https://hub.docker.com/r/envoyproxy/envoy-build-ubuntu/tags?page=1&name=mobile-

phlax commented 1 year ago

also, ironically envoy ci is currently failing on one of the only bits thats not yet cached

phlax commented 1 year ago

~im pretty confused~

apologies for noise - this was my own stupidity

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.