carvel-dev / vendir

Easy way to vendor portions of git repos, github releases, helm charts, docker image contents, etc. declaratively
https://carvel.dev/vendir
Apache License 2.0
276 stars 49 forks source link

Vendir fails with a misterious error #193

Closed Zebradil closed 1 year ago

Zebradil commented 1 year ago

I have a CI job which runs multiple (23 currently) parallel processes running vendir.

Almost every time I can see the following error:

vendir: Error: Syncing directory 'vendor':
  Syncing directory 'chart' with helm chart contents:
    Add helm chart repository: signal: killed (stderr: )

It appears sporadically and is not related to a particular configuration vendir is running against.

I couldn't reproduce it locally yet and debugging inside a CI job is a bit problematic. So I hope that someone can have an idea of where to look for the root cause or which tools could be helpful for debugging (it seems like vendir doesn't have a --debug flag or something similar).

joaopapereira commented 1 year ago

Hey @Zebradil When did you say 23 currently parallel processes do they all execute vendir?

That particular error happens when vendir is executing helm repo add vendir-unused your.chart.url and apparently helm is not providing any stderr output. Is it possible that helm is failing when trying to add repositories in parallel?

If you want you(@João Pereira) can reach me in Kubernetes slack and I can send you a version of vendir that also prints the stdout of the command to see if it is helpful.

Zebradil commented 1 year ago

Hi @joaopapereira,

Thank you for the quick response.

When did you say 23 currently parallel processes do they all execute vendir?

It's a bash script, which loops through a set of environments and runs a function for each of them. Each invocation of this function is done in the background via & operator. The function runs vendir at some point.

That means that it can run 23 vendir parallel processes in theory. However, because running vendir takes the most time of the function, there are definitely many vendir processes running at the same time even though they aren't started all at the same time.

Is it possible that helm is failing when trying to add repositories in parallel?

This looks like a good point for investigation. I'll need to look into the source code to see how exactly vendir works with helm. (As a side note: this could also help to see how we can improve the duration of this step, as it takes significant time to download our 30Mb helm index for every vendir process).

joaopapereira commented 1 year ago

https://github.com/vmware-tanzu/carvel-vendir/blob/develop/pkg/vendir/fetch/helmchart/http_source.go#L115 is the line where the error is happening and if you look 15 lines up we are preparing the command that I said in previous message.

Basically vendir interacts with helm the same way you would as a user. so it is possible that at any given moment a second vendir-unused repository is added and that can cause any sort of issue in helm itself.

I do not know much about what yll are doing, but what is preventing you from calling vendir before you do look at the environments? Are the vendir configurations different depending on the environment? could you eventually run a script that merges all the vendir configs into 1 and execute vendir before doing the loop?

Just throwing some ideas around.

cppforlife commented 1 year ago

my guess would be that OS is killing child process due to out-of-memory (i dont know if you can confirm it via dmesg tail in parallel) and so vendir just sees process was killed. have you tried throttling number of processes you are spinning up in parallel to see if you can non-failure state consistently?

Zebradil commented 1 year ago

Thank you for your help and sorry for my absence.

The problem seems to be resolved with more RAM.

Basically vendir interacts with helm the same way you would as a user. so it is possible that at any given moment a second vendir-unused repository is added and that can cause any sort of issue in helm itself.

To me it looks like vendir creates a brand new helm home directory, thus parallel vendir processes should not interfere at this point.

what is preventing you from calling vendir before you do look at the environments? Are the vendir configurations different depending on the environment? could you eventually run a script that merges all the vendir configs into 1 and execute vendir before doing the loop?

In my project, I maintain states of many kubernetes clusters (==environments). Every environment has a set of applications and every application has its own vendir configuration for fetching sources. The project is structured in a way that each application in each environment is a separate entity and should be processed separately. Merging multiple vendir configs into one is possible in theory, but implies some restrictions in a UX. Currently, I can specify particular environments and applications to process. Fetching sources is also the slowest part of the flow and it's nice to skip it when it is not needed.

OS is killing child process due to out-of-memory

Local tests showed that the whole process can consume at least ~750Mb. The default limit for CI jobs was 1Gb, so I tend to agree that it was the OOM killer. I can't say for sure, because debugging in a CI environment without access to the runner is a bit problematic.

have you tried throttling number of processes

Not yet, but this is something worth implementing anyways, as the current approach doesn't seem to scale nicely.

cppforlife commented 1 year ago

The problem seems to be resolved with more RAM.

@Zebradil anything else we should work through in this thread before we close this?

Zebradil commented 1 year ago

Apologies, I just forgot to close the issue.