firecow / gitlab-ci-local

Tired of pushing to test your .gitlab-ci.yml?
MIT License
2.38k stars 134 forks source link

Leaks networks on every run #1407

Open elafontaine opened 2 weeks ago

elafontaine commented 2 weeks ago

Minimal .gitlab-ci.yml illustrating the issue

---
docker_build:
  stage: package
  image: docker:latest
  services:
    - docker:dind
  script:
    - echo "blablabla"

Expected behavior After running, clear the network that was needed for the service and the job container to talk together.

Host information MacOS gitlab-ci-local 4.55.0

Containerd binary docker

Additional context https://github.com/firecow/gitlab-ci-local/blob/master/src/job.ts#L543 < not tracked for cleanup

firecow commented 1 week ago

I cannot reproduce

image

image

Plus the code you are referencing does in fact illustrate that a serviceNetworkId is stored and used in the cleanup function.

elafontaine commented 1 week ago

Hi @firecow ,

This is the output of my "docker network ls" currently (minus some redacted stuff for my company);

➜   docker network ls
NETWORK ID     NAME                     DRIVER    SCOPE
51fbb22039db   bridge                   bridge    local
9fe84a65a347   docker_gwbridge          bridge    local
6df0aef1c8a5   gitlab-ci-local-130397   bridge    local
6e78377fe61e   gitlab-ci-local-200711   bridge    local
29b6022248e3   gitlab-ci-local-201744   bridge    local
176f9fc46cc9   gitlab-ci-local-235698   bridge    local
dd8619c29826   gitlab-ci-local-284263   bridge    local
d9cc612fdb5a   gitlab-ci-local-351190   bridge    local
884c9c02eee9   gitlab-ci-local-371592   bridge    local
d147c413d3f5   gitlab-ci-local-375682   bridge    local
1f7e90481cfc   gitlab-ci-local-501394   bridge    local
cdbf32f7f9e6   gitlab-ci-local-535650   bridge    local
1b4057b7b5f9   gitlab-ci-local-558862   bridge    local
b6b57e9795c8   gitlab-ci-local-574073   bridge    local
7e34c53c5bff   gitlab-ci-local-579972   bridge    local
ccd262ce6df9   gitlab-ci-local-654062   bridge    local
02cb192c820a   gitlab-ci-local-668695   bridge    local
e866a4a3540a   gitlab-ci-local-714030   bridge    local
23964309e2f7   gitlab-ci-local-738116   bridge    local
54d988391d24   gitlab-ci-local-768931   bridge    local
f8da1545297a   host                     host      local
y18vrgxhbg68   ingress                  overlay   swarm

I do see the network clean up after in theory, but in practice, something is probably going wrong. My guess is that I should be seeing some message somewhere based on https://github.com/firecow/gitlab-ci-local/blob/master/src/job.ts#L593 or maybe the assert of the containers is what caused the network cleanup to be skipped? I believe the latter may be the case considering those assert are within the catch. I'm not familiar with Javascript "assert" and the best practices, however, I fail to understand how it wouldn't be an "Error" instance...

ANGkeith commented 1 week ago

likely these are leaked from the test suites, you can replicate it by npm run test and after few seconds

elafontaine commented 1 week ago

I never ran the test suite of gitlab-ci-local. I run gitlab-ci-local from the brew installation.

These are probably leaked from my job that run a docker in docker and failed to complete. The failure probably triggered some other failures (containers not being removed and skipping the rest of the cleanup ?)

ANGkeith commented 6 days ago

hmm, ic, not sure then.. i dont really run docker-in-docker pipeline

hopefully, it's something that is replicable

elafontaine commented 6 days ago

I just had the problem with a job that just runs out of a container... no issue that I know of...

For those having the same issue, here is what I ran ;

 for network in $(docker network ls); do if [[ "$network" == *"gitlab"* ]]; then echo "$network"; docker network rm $network; fi ; done 
gitlab-ci-local-9409
gitlab-ci-local-9409
gitlab-ci-local-95666
gitlab-ci-local-95666
gitlab-ci-local-130397
gitlab-ci-local-130397
gitlab-ci-local-200711
gitlab-ci-local-200711
gitlab-ci-local-201744
gitlab-ci-local-201744
gitlab-ci-local-235698
gitlab-ci-local-235698
gitlab-ci-local-284263
gitlab-ci-local-284263
gitlab-ci-local-351190
gitlab-ci-local-351190
gitlab-ci-local-371592
gitlab-ci-local-371592
gitlab-ci-local-375682
gitlab-ci-local-375682
gitlab-ci-local-451685
gitlab-ci-local-451685
gitlab-ci-local-501394
gitlab-ci-local-501394
gitlab-ci-local-509319
gitlab-ci-local-509319
gitlab-ci-local-535650
gitlab-ci-local-535650
gitlab-ci-local-536928
gitlab-ci-local-536928
gitlab-ci-local-558862
gitlab-ci-local-558862
gitlab-ci-local-562280
gitlab-ci-local-562280
gitlab-ci-local-574073
gitlab-ci-local-574073
gitlab-ci-local-579972
gitlab-ci-local-579972
gitlab-ci-local-654062
gitlab-ci-local-654062
gitlab-ci-local-668695
gitlab-ci-local-668695
gitlab-ci-local-700167
gitlab-ci-local-700167
gitlab-ci-local-714030
gitlab-ci-local-714030
gitlab-ci-local-738116
gitlab-ci-local-738116
gitlab-ci-local-768931
gitlab-ci-local-768931
gitlab-ci-local-788165
gitlab-ci-local-788165
gitlab-ci-local-859507
gitlab-ci-local-859507
elafontaine commented 5 days ago

I've got it again this morning :) I'm pretty sure it accumulate on a job failure...

I'm currently trying to debug a job we have defined that starts a service container of mockserver, start our webcomponent and start testing against the webcomponent. I got a failure in my tests, which fails the job, but nothing special about it...

I've ran that job over 50 times at least yesterday...

elafontaine commented 9 hours ago

Got it again yesterday, and I had it today as well. I will try to notice which "job" leave some network behind... the problem I have is that my workflow is dependant on gitlab-ci-local to run anything 😅 (we're bought on the concept of everything needs to be runnable in the CI and locally).

However, the jobs I've been running were just starting a service for mockserver and the other was a shell job (no relation to docker)...

At this point, I'm pretty sure it's when the job fails and has a service that the network isn't cleaned... I don't know how I could dig out more information for this ticket. If you have an idea, please let me know.

elafontaine commented 7 hours ago

Ok, I can now say for sure that the leak is happening on successful run as well...

The job that leaks is using a service with an alias... I have yet to be able to determine what causes the leak... is it the service container not closing fast enough ?