ECONNRESET results in failed cache save/restore

grzleadams commented 1 year ago

We regularly see the following error when either saving or restoring a cache using on-prem Runners (deployed in Kubernetes via ARC):

Warning: Failed to restore: getCacheEntry failed: read ECONNRESET
Cache not found for input keys: <cache_name>

Checking the UI, it's clear the cache exists with the key we're using, and sometimes the save succeeds while the restore fails, or vice versa, with no really obvious pattern as to what will happen on any given run. We've ruled out network/firewall issues on our end and these failures don't seem to necessarily correspond with any reported GitHub service outages so I'm kind of at a loss as to how to resolve this.

It seems like other actions (such as actions/upload-artifact) have dealt with this by implementing retries with exponential backoff. Is there any kind of retry mechanism built into the save/restore functions to deal with transient failures like this in this action?

grzleadams commented 1 year ago

Some additional info that might be useful (since there are other issues that are sort-of like this, but with different circumstances):

The jobs all run on Linux containers.
The cache is very small (~200KB).
The save/restore failure due to ECONNRESET happens fairly quickly (after about 45s).

grzleadams commented 1 year ago

So, it looks like retries are built in with SystemErrorRetryPolicy. Is the maximum number of retries configurable? In restore/index.js it looks like it's hard-coded to 3.

grzleadams commented 1 year ago

Closing this because it's not caused by the Action.

For posterity: we had an MTU mismatch, and applying the changes described here seems to have resolved the problem.

actions / cache

ECONNRESET results in failed cache save/restore #1186