[Bug]: Intermittent "Restore cache build" failure

Gudahtt commented 4 months ago

Describe the bug

The "Restore cache build" step of our Bitrise CI runs has been intermittently failing with the error: "download failed: failed to download archive: Range request returned invalid Content-Length: 149 however the range was: [...]".

For example: https://app.bitrise.io/build/218004f7-92b4-4dc1-92d0-3ea214b18214

In such cases, we can see that the cache does appear to exist on the Bitrise dashboard and has the correct size.

This is the step where the node_modules directory for iOS builds is restored. This directory is cached per-commit during the build job, then restored in later steps.

(Edit: We have now also seen the "cache build" step failing in the same way)

We only see this fail for the iOS cache, which is around 580 MB. The Android cache is ~4x larger, but we haven't seen an example of that failing with this message yet.

Expected behavior

This step should not fail frequently. Rare failures due to Bitrise infrastructure errors can be expected to occur rarely, but the failures in this case are very common.

Screenshots/Recordings

No response

Steps to reproduce

No clear steps to reproduce yet.

Error messages or log output

No response

Version

main

Build type

None

Device

N/A

Operating system

Other (please elaborate in the "Additional Context" section)

Additional context

No response

Severity

No response

Gudahtt commented 4 months ago

We should contact Bitrise about this. It appears to be an infrastructure error on their end, though that's just a guess.

Edit: I have created a support ticket. Will update here when I hear back.

Gudahtt commented 4 months ago

All of the "Restore cache" steps are apparently considered "skippable" by Bitrise, but in this case our build requires the cache restoration to succeed because of how we're using it.

We should consider either marking this step is required somehow, and/or adding a retry or fallback in case it does fail.

Gudahtt commented 4 months ago

This step does have a retries option, but it's already set to 3 (the default). I'm not seeing evidence of whether these retries occurred or not. Maybe we should enable verbose logs to get more insight on this.

Gudahtt commented 4 months ago

We encountered this bug on #9153, and the error happened consistently 9 times in a row (3 times per run, on 3 separate runs). After deleting the cache key that was failing at the restore step and re-running the entire workflow, it succeeded.

This suggests that the cache entry is in a bad state, and causing later failures.

To workaround this problem, either delete the problematic cache key and rerun the entire workflow, or push a new commit (both caches that we have found include the commit hash in the cache key).

Gudahtt commented 4 months ago

I enabled verbose logging in a PR, and have an example successful and failing CI run.

Failing: https://app.bitrise.io/app/be69d4368ee7e86d/pipelines/a420d154-4c41-4bde-babf-ba2a6d7912c7 Successful: https://app.bitrise.io/app/be69d4368ee7e86d/pipelines/85de5407-32d2-4f49-b72b-865ac4804b0b

This at least verifies that the retries: 3 setting is effective. It is retrying 3 times. Whatever error this is results in consistent failure.

gauthierpetetin commented 4 months ago

Response from Bitrise support available here: https://consensys.slack.com/archives/C02U025CVU4/p1713966058666349?thread_ts=1713562872.462029&cid=C02U025CVU4

Gudahtt commented 1 month ago

We are still caching node_modules by commit hash, e.g. using Bitrise caching as a means of sharing data between steps of our workflow. So this is still broken.

Gudahtt commented 1 month ago

The cache step should be used to cache data between runs. Not to share data within a workflow. As the linked PR states, that's what deploy and pull should be used for.

And as for caching between runs, we should be caching the Yarn cache directly rather than node_modules., which is how it used to work. This was changed at some point to cache node_modules instead so that we could skip the postinstall step, which was quite long. But this is unreliable as well, because it assumes that postinstall scripts operate only on node_modules, which is not the case. ANy that operate on files outside of that directory would be skipped, potentially leading to failure. And we can't cache node_modules and re-run postinstall scripts either, because some scripts are meant to only be run once upon install. The only thing we can cache reliably is the Yarn cache.

This will save CI time, and it will never become corrupted, and it will be dramatically smaller in size (which was linked to the cache corruption that requires manual clearing). It won't be quite as fast as caching node_modules, but the time we're saving here is lost in flaky runs, unreliable results, cache corruption, general confusion, etc. It's by far not worth it.

MetaMask / metamask-mobile