Roblox / nomad-driver-containerd

Nomad task driver for launching containers using containerd.
Other
219 stars 35 forks source link

[question] Will Nomad Consul Connect Envoy proxy work with containerd driver? #59

Open Oloremo opened 3 years ago

Oloremo commented 3 years ago

Hello,

By default, Nomad launches the Envoy proxy as a docker container: https://www.nomadproject.io/docs/job-specification/sidecar_task#default-envoy-configuration

I wonder if it could successfully run with a containerd driver.

shishir-a412ed commented 3 years ago

@Oloremo Have you tried running it?

You should be able to change the driver from docker to containerd-driver and set the image and args.

Oloremo commented 3 years ago

Not yet, we're in the PoC stages with Nomad currently and I wonder if you folks tried it already and willing to share your experience.

As you can see I wonder if we could use Nomad without the docker engine installed at all.

shishir-a412ed commented 3 years ago

@Oloremo

Not yet, we're in the PoC stages with Nomad currently and I wonder if you folks tried it already and willing to share your experience.

No, We haven't tried it. Let me know how it goes when you try it, and if you run into any issues.

As you can see I wonder if we could use Nomad without the docker engine installed at all.

This is exactly the use-case for which nomad-driver-containerd was designed. To be able to launch jobs in nomad (directly using containerd) without docker-engine installed at all. If you want to try it in a local environment, you can just clone this repo, and do vagrant up from the project $root directory. That will spin up an Ubuntu VM for you with just nomad and containerd installed on it (No docker engine on the VM) and you can try out by launching some example jobs.

Oloremo commented 3 years ago

Ok, we'll do some Consul Connect related experiments in Q1 2021 and I'll report back if it works.

Oloremo commented 3 years ago

@shishir-a412ed

So I tried to run a countdash Nomad example for Service Mesh using a containerd-driver 0.7.0 and containerd runtime 1.4.3. It has a weird issue.

It's started and envoy proxy side-car are starting and failing with:

su-exec: setgroups: Operation not permitted

Job spec:

job "countdash" {
  datacenters = ["dc1"]

  group "api" {
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
        sidecar_task {
          driver = "containerd-driver"
          config {
            #image = "${meta.connect.sidecar_image}"
            image = "docker.io/envoyproxy/envoy:v1.16.0"

            command = "/docker-entrypoint.sh"
            args = [
              "-c",
              "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
              "-l",
              "${meta.connect.log_level}",
              "--concurrency",
              "${meta.connect.proxy_concurrency}",
              "--disable-hot-restart"
            ]
          }

          logs {
            max_files     = 2
            max_file_size = 2 # MB
          }

          resources {
            cpu    = 250 # MHz
            memory = 128 # MB
          }

          shutdown_delay = "5s"
        }
      }
    }

    task "web" {
      driver = "containerd-driver"
      config {
        image = "docker.io/hashicorpnomad/counter-api:v3"
      }
    }
  }

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
        sidecar_task {
          driver = "containerd-driver"
          config {
            #image = "${meta.connect.sidecar_image}"
            image = "docker.io/envoyproxy/envoy:v1.16.0"

            command = "/docker-entrypoint.sh"
            args = [
              "-c",
              "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
              "-l",
              "${meta.connect.log_level}",
              "--concurrency",
              "${meta.connect.proxy_concurrency}",
              "--disable-hot-restart"
            ]
          }

          logs {
            max_files     = 2
            max_file_size = 2 # MB
          }

          resources {
            cpu    = 250 # MHz
            memory = 128 # MB
          }

          shutdown_delay = "5s"
        }
      }
    }

    task "dashboard" {
      driver = "containerd-driver"
      config {
        image = "docker.io/hashicorpnomad/counter-dashboard:v3"
      }

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }
    }
  }
}
Oloremo commented 3 years ago

Tried to add a

privileged = true

to a config {} stanza with the same results.

Oloremo commented 3 years ago

Ok so it's from envoy container entrypoint script: https://github.com/envoyproxy/envoy/blob/v1.16.0/ci/docker-entrypoint.sh

Not sure why it's unable to execute it with containerd driver.

Oloremo commented 3 years ago

@shishir-a412ed Sorry for the ping, I wasn't sure if you saw the test above

shishir-a412ed commented 3 years ago

@Oloremo Sorry, I should have responded earlier 🙂 . We are still working on the initial rollout, and currently integration with consul connect is not super high on the priority list. I will try to find some time this week and see if I can reproduce/debug this.

Oloremo commented 3 years ago

Thanks for the reply!

We'll wait. :) It's not super critical for us right now but we do want to use Service Mesh in the near future.

Oloremo commented 3 years ago

@shishir-a412ed Sorry for the ping, just wanted to check if you had time to check on this.

shishir-a412ed commented 3 years ago

@Oloremo Sorry I have not been able to get to this. Did you try to debug more? You can try to add some print statements in the bash script docker_entrypoint.sh.

e.g.

  1. You can take the original bash script docker_entrypoint.sh.
  2. Create a custom version of that, with your added debug statements.
  3. Build a new envoy image with your custom script as the entrypoint, and use that instead to debug more.

I ll try to see if I can find sometime to debug more, but I have been juggling with few other things. Internally, containerd-driver has been bumped down in terms of priority, because we have a few other high priority items we need to take care of right now.

shishir-a412ed commented 3 years ago

@Oloremo We are doing some internal work around consul service mesh (not using containerd-driver) but it's helpful for me to pick up the background/context. I ll see if I can find sometime over the weekend to look into this. No promises :) but will try to get to this sooner than I was planning.

Oloremo commented 3 years ago

Appreciate that! I wasn't able to go back to that issue as well. Pls ping me if you'll need some additional testing.

shishir-a412ed commented 3 years ago

@Oloremo Also, looking at your job spec it looks like you are using both sidecar_service and sidecar_task which might be incorrect.

Looking at the official docs: https://www.nomadproject.io/docs/job-specification/sidecar_task#default-envoy-configuration

Nomad automatically launches and manages an Envoy task for use as a proxy sidecar or connect gateway, when sidecar_service or gateway are configured.

The default Envoy task is equivalent to:

When you specify:

connect {
    sidecar_service {}
}

Nomad will automatically launch the Envoy sidecar proxy for you. The sidecar_task definition is just how it looks like under the hood. In your example, the api service (upstream) will only need:

connect {
    sidecar_service{}
}

and the dashboard service (downstream) will need to define count-api as the upstream.

connect {
   sidecar_service {
        proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
        }
   }
}

I ll take a deeper look over the weekend.

Oloremo commented 3 years ago

Nomad will automatically launch the Envoy sidecar proxy for you.

With docker driver. I wanted to launch it with containerd one since it's not yet allowed to change driver without full re-definition - I did the full re-definition just to test it.

shishir-a412ed commented 3 years ago

@Oloremo aah I see! Makes sense 🙂

Oloremo commented 2 years ago

@shishir-a412ed Hey just wanted to say that we're still interested in that. :-)

The only thing that stop us from moving to containerd plugin is Consul Connect

shishir-a412ed commented 2 years ago

@Oloremo Let me see if I can find sometime to progress this.

mister2d commented 2 years ago

Hi there. I happened to stumble upon this ticket. Setting the driver for sidecar_task is already supported in Nomad. Here is an example:

job "example" {
    datacenters = ["dc1"]
    group "api" {
        network {
            mode = "bridge"
        }
        service {
            name = "example-api"
            port = "9001"
            connect {
                sidecar_service = {}
                sidecar_task {
                    driver = "containerd-driver"
                }
            }
        }
        task "web" {
            driver = "containerd-driver"
            config {
                image = "<image>"
            }
        }
    }
}

https://www.nomadproject.io/docs/job-specification/sidecar_task#driver

Oloremo commented 2 years ago

@mister2d interesting!

I'm trying to test it with:

Jobspec ``` job "countdash" { datacenters = ["dc1"] group "api" { network { mode = "bridge" } service { name = "count-api" port = "9001" connect { sidecar_service {} sidecar_task { driver = "containerd-driver" } } } task "web" { driver = "containerd-driver" config { image = "hashicorpnomad/counter-api:v3" } } } group "dashboard" { network { mode = "bridge" port "http" { static = 9002 to = 9002 } } service { name = "count-dashboard" port = "9002" connect { sidecar_service { proxy { upstreams { destination_name = "count-api" local_bind_port = 8080 } } } sidecar_task { driver = "containerd-driver" } } } task "dashboard" { driver = "containerd-driver" env { COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}" } config { image = "hashicorpnomad/counter-dashboard:v3" } } } } ```

But I'm getting a:

Recent Events:
Time                  Type               Description
2022-04-13T12:25:10Z  Killing            Sent interrupt. Waiting 5s before force killing
2022-04-13T12:25:10Z  Not Restarting     Error was unrecoverable
2022-04-13T12:25:10Z  Failed Validation  2 errors occurred:
    * failed to parse config:
    * Root value must be object: The root value in a JSON-based configuration must be either a JSON object or a JSON array of objects.
MagicRB commented 2 years ago

I can confirm it works for me, I had docker job running consul connect before and the only thing i changed was the driver to containerd. Connect still works. I don't really know why but I'm happy. If you want to dig around my config then https://gitea.redalder.org/RedAlder/systems (I'm using a patched version of this repo that adds support for Nix flakes, the changes I made shouldn't affect consul connect though)

Oloremo commented 2 years ago

@MagicRB it's a big repo and I clicked a few things and saw Docker as a driver.

Anyway it's better to test and confirm by using a default countdash example as I did above. It would remove all other configurations and make it reproducible for anyone.