falcondev-oss / github-actions-cache-server

Self-hosted GitHub Actions cache server implementation. Compatible with official 'actions/cache' action
https://gha-cache-server.falcondev.io
MIT License
134 stars 11 forks source link

Malformed cache-entries when running multiple replicas #65

Open Kleinkind opened 1 day ago

Kleinkind commented 1 day ago

Hello πŸ‘‹

We get the following error during cache-load in our jobs:

/*stdin*\ : Read error (39) : premature end 
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now
Warning: Failed to restore: "/usr/bin/tar" failed with error: The process '/usr/bin/tar' failed with exit code 2

The error seems to be somewhat similar to those reported in https://github.com/falcondev-oss/github-actions-cache-server/issues/54 but still differs and the cause seems to be different (atleast from my testing).

Some information on our setup:

As we only face these problems on some of our projects I tested around to narrow down the cause. These are my findings:

It seems like the stored cache-archive is actually incomplete

It seems to be related to the replicated setup

It seems to be related to cache size / multipart uploads

My suspicion is, that this could be caused by the in-memory uploadFileBuffers (https://github.com/falcondev-oss/github-actions-cache-server/blob/dev/lib/storage/index.ts#L38 ) which might lead to problems when a cache-upload is chunked and send via different replicas of the cache-servers.

I understand that this might be somewhat of an niche problem because there is no documented way or example of running the cache-server in a replicated setup. Also we scaled down to a single instance as a workaround for now. But I expect that with the addition of a helm chart to this repo (https://github.com/falcondev-oss/github-actions-cache-server/pull/58) which enables autoscaling and running multiple replicas more people will face the same problems.

asutosh23 commented 1 day ago

hey @Kleinkind , I trying to set up the cache server and failing at it tried to contact you but couldn't find any medium for it so commenting here, not an usual way to communicate with another engineer. kind of in a hurry so if you have some time, then could you please help me with the setup

matteovivona commented 1 day ago

@Kleinkind I found several issues with the proposed helm chart on that PR https://github.com/falcondev-oss/github-actions-cache-server/pull/58, but I think it could be a good starting point.

Anyway, currently I'm running the cache server with 2 replicas and persistentVolume disabled. This is because I noticed that the app only uses the ephemeral volume mounted under /tmp and not the one mounted in /app/.data. Even with just 1 replica, I still encounter the same error of issue https://github.com/falcondev-oss/github-actions-cache-server/issues/54. I believe the different hashes are partially to blame, but I don't think it's solely a multi-replica issue

asutosh23 commented 1 day ago

hey @matteovivona need a little help in setting up the cache server my runners are not picking up the server url followed all the steps from the docs tried to debug a lot couldn't find what's wrong

could you please help me with it? how can I contact you

Kleinkind commented 1 day ago

hey @Kleinkind , I trying to set up the cache server and failing at it tried to contact you but couldn't find any medium for it so commenting here, not an usual way to communicate with another engineer. kind of in a hurry so if you have some time, then could you please help me with the setup

I am not a maintainer of this project and your comment has nothing to do with this issue. If you are facing problems I think the best way is to create an issue on your own and ask for help there.

asutosh23 commented 1 day ago

okay, creating an issue then

LouisHaftmann commented 1 day ago

Thanks for reporting and debugging this! I'm a bit short on time right now but I'll take a look this weekend. @Kleinkind

LouisHaftmann commented 1 day ago

okay, creating an issue then

pls open a new issue πŸ™