Open Gudahtt opened 4 months ago
We should contact Bitrise about this. It appears to be an infrastructure error on their end, though that's just a guess.
Edit: I have created a support ticket. Will update here when I hear back.
All of the "Restore cache" steps are apparently considered "skippable" by Bitrise, but in this case our build requires the cache restoration to succeed because of how we're using it.
We should consider either marking this step is required somehow, and/or adding a retry or fallback in case it does fail.
This step does have a retries
option, but it's already set to 3 (the default). I'm not seeing evidence of whether these retries occurred or not. Maybe we should enable verbose logs to get more insight on this.
We encountered this bug on #9153, and the error happened consistently 9 times in a row (3 times per run, on 3 separate runs). After deleting the cache key that was failing at the restore step and re-running the entire workflow, it succeeded.
This suggests that the cache entry is in a bad state, and causing later failures.
To workaround this problem, either delete the problematic cache key and rerun the entire workflow, or push a new commit (both caches that we have found include the commit hash in the cache key).
I enabled verbose logging in a PR, and have an example successful and failing CI run.
Failing: https://app.bitrise.io/app/be69d4368ee7e86d/pipelines/a420d154-4c41-4bde-babf-ba2a6d7912c7 Successful: https://app.bitrise.io/app/be69d4368ee7e86d/pipelines/85de5407-32d2-4f49-b72b-865ac4804b0b
This at least verifies that the retries: 3
setting is effective. It is retrying 3 times. Whatever error this is results in consistent failure.
Response from Bitrise support available here: https://consensys.slack.com/archives/C02U025CVU4/p1713966058666349?thread_ts=1713562872.462029&cid=C02U025CVU4
We are still caching node_modules
by commit hash, e.g. using Bitrise caching as a means of sharing data between steps of our workflow. So this is still broken.
The cache step should be used to cache data between runs. Not to share data within a workflow. As the linked PR states, that's what deploy
and pull
should be used for.
And as for caching between runs, we should be caching the Yarn cache directly rather than node_modules
., which is how it used to work. This was changed at some point to cache node_modules
instead so that we could skip the postinstall step, which was quite long. But this is unreliable as well, because it assumes that postinstall scripts operate only on node_modules
, which is not the case. ANy that operate on files outside of that directory would be skipped, potentially leading to failure. And we can't cache node_modules
and re-run postinstall scripts either, because some scripts are meant to only be run once upon install. The only thing we can cache reliably is the Yarn cache.
This will save CI time, and it will never become corrupted, and it will be dramatically smaller in size (which was linked to the cache corruption that requires manual clearing). It won't be quite as fast as caching node_modules
, but the time we're saving here is lost in flaky runs, unreliable results, cache corruption, general confusion, etc. It's by far not worth it.
Describe the bug
The "Restore cache build" step of our Bitrise CI runs has been intermittently failing with the error: "download failed: failed to download archive: Range request returned invalid Content-Length: 149 however the range was: [...]".
For example: https://app.bitrise.io/build/218004f7-92b4-4dc1-92d0-3ea214b18214
In such cases, we can see that the cache does appear to exist on the Bitrise dashboard and has the correct size.
This is the step where the
node_modules
directory foriOS
builds is restored. This directory is cached per-commit during the build job, then restored in later steps.(Edit: We have now also seen the "cache build" step failing in the same way)
We only see this fail for the iOS cache, which is around 580 MB. The Android cache is ~4x larger, but we haven't seen an example of that failing with this message yet.
Expected behavior
This step should not fail frequently. Rare failures due to Bitrise infrastructure errors can be expected to occur rarely, but the failures in this case are very common.
Screenshots/Recordings
No response
Steps to reproduce
No clear steps to reproduce yet.
Error messages or log output
No response
Version
main
Build type
None
Device
N/A
Operating system
Other (please elaborate in the "Additional Context" section)
Additional context
No response
Severity
No response