Open phlax opened 1 year ago
one consideration if we setup the cache is where - with azp you get a free unlimited cache - but for our self-hosted machines its not nearly as fast as it is for the ms-hosted ones
with github actions - setting up an s3 bucket cache is an option
also not sure what spec the machines have - iiuc they are all ms-hosted ones which means that getting a build image as large as the mobile one into a running docker quickly may be hard - tmpfs may not be an option
if any have spare RAM capacity we could also just mount some of the bazel dirs tmpfs as we have done for envoy
also related to this - ive been working on thinning out the build images used by ci
if mobile ci uses RBE there is a good chance that it doesnt need llvm, which means we could reduce the mobile build image size significantly
as if to underline this point, the build image has been unavailable for the last few hours
Starting job container
/usr/bin/docker pull envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757
Error response from daemon: manifest for envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757 not found: manifest unknown: manifest unknown
Warning: Docker pull failed with exit code 1, back off 9.731 seconds before retry.
/usr/bin/docker pull envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757
Error response from daemon: manifest for envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757 not found: manifest unknown: manifest unknown
Warning: Docker pull failed with exit code 1, back off 6.28 seconds before retry.
/usr/bin/docker pull envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757
Error response from daemon: manifest for envoyproxy/envoy-build-ubuntu:mobile-321658b6b120abda6869f89fac275f59bf3b1e757 not found: manifest unknown: manifest unknown
https://github.com/envoyproxy/envoy/actions/runs/4805922388/jobs/8554285389?pr=26969#step:2:26
https://hub.docker.com/r/envoyproxy/envoy-build-ubuntu/tags?page=1&name=mobile-
also, ironically envoy ci is currently failing on one of the only bits thats not yet cached
~im pretty confused~
apologies for noise - this was my own stupidity
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Recently, we have added a bunch of caching to Envoy's CI which has both sped things up and fixed (or seriously mittigated) a load of persistent flakes that have been endemic (optimistically i would say until ~now)
Essentially there are a few things worth caching:
Regarding the docker image it is huge - to make it work faster than pulling from dockerhub in Envoy's CI i have had to load it using tmpfs - any other way would be significantly slower
I held off doing this for a long time because dockerhub can be faster - but now that we have done this it has been a huge improvement to both performance and reliability. Anecdotally i would say that the azp cache gets faster the more you use it whereas the reverse seems to be true with the dockerhub cache (with some caveats)
Likewise with the bazel directories, a huge number of flakes occur just because of transient network issues, and caching these mostly eliminates this problem
While working on this for Envoy one thing i noticed was that incorrect use of bazel was creating multiple environments, fixing that was a small but sure optimization. Envoy now dumps pretty good fs info at the end of the build, and this is invaluable to understanding what bazel is doing and why it goes wrong - i strongly suggest we do the same for the mobile ci