concourse / concourse

Concourse is a container-based continuous thing-doer written in Go.
https://concourse-ci.org
Apache License 2.0
7.38k stars 847 forks source link

Certificate Propagation is broken #5190

Open siennathesane opened 4 years ago

siennathesane commented 4 years ago

I'm filing this here instead of the bosh release repo because it was surfaced via bosh but can be verified manually.

Bug Report

Inheriting CONCOURSE_CERTS_DIR or certs_path breaks the certificate pool configuration. Normally in Linux, external certificates are generally placed in /etc/ssl/certs via the update-ca-certificates command. For that command to place them in /etc/ssl/certs, the new external certificates have to exist in /usr/local/share/ca-certificates and then any extra certificates are appended to /etc/ssl/certs/ca-certificates.crt.

I found this bug when I created https://github.com/concourse/concourse-bosh-release/pull/92 as a way to keep the BOSH release current. When I applied that patch to my system, I noticed the tasks were immediately failing:

resource script '/opt/resource/check []' failed: exit status 1

stderr:
failed to ping registry: 2 error(s) occurred:

* ping https: Get https://registry-1.docker.io/v2/: x509: certificate signed by unknown authority
* ping http: Get https://registry-1.docker.io/v2/: x509: certificate signed by unknown authority

The only way the Docker certificate would be no longer trusted is if it didn't exist. From what I can tell, here is the bug:

return garden.BindMount{
        SrcPath: volume.Path(),
        DstPath: "/etc/ssl/certs",
        Mode:    garden.BindMountModeRO,
    }, true, nil

Because the tasks are failing to run the resource checks (can't pull the container), builds also fail to be intercepted:

fly i -u https://airport.r3t.io/teams/infrastructure/pipelines/deploy-infrastructure/jobs/prepare-kubernetes/builds/8
no containers matched your search parameters!

they may have expired if your build hasn't recently finished.

Pipeline:

image

Resource:

image

Steps to Reproduce

Ops file reference:

---
- type: replace
  path: /instance_groups/name=worker/jobs/name=worker/properties/certs_path?
  value: /usr/local/share/ca-certificates
  1. Create a self-signed certificate.
  2. Deploy a BOSH director with the certificate in the director.trusted_certs property.
  3. Deploy concourse via the bosh release and add the referenced ops file so it's compatible with the current version of BOSH as referenced in https://github.com/concourse/concourse-bosh-release/pull/92.
  4. Try to run a task, it should fail with the same Docker error.
  5. Try to intercept any container, it should also fail.

Expected Results

The self-signed certificates should be mounted in the container at the same path in which they are referenced via CONCOURSE_CERTS_DIR or certs_path

Actual Results

The certificates were configured properly but overwrote /etc/ssl/certs, which (from what I can tell) removed the existing bundled certificates.

Additional Context

I believe DstPath should also be volume.Path() because the goal is to mount the extra certificates at the path in which they exist on the host, not overwrite certificates from a different directory. The certificate mount should be additive, not replacing.

Version Info

Edit: added pictures for humans.

siennathesane commented 4 years ago

@vito would you be able to have someone take a look and/or verify this is or is not the expected behaviour?

siennathesane commented 4 years ago

👈

siennathesane commented 4 years ago

Any updates on this?

cirocosta commented 4 years ago

Hey @mxplusb,

sorry for the long time to get back to you :(

I believe DstPath should also be volume.Path() because the goal is to mount the extra certificates at the path in which they exist on the host, not overwrite certificates from a different directory. The certificate mount should be additive, not replacing.

That's a great line of thought, but I believe it can be tricky when you don't control the resource types the installation uses - the behavior for loading the certificates is very specific to the resource types that use them.

e.g., the way that a Go-based resource type loads their root certificates

// Possible certificate files; stop after finding one.
var certFiles = []string{
    "/etc/ssl/certs/ca-certificates.crt",                // Debian/Ubuntu/Gentoo etc.
    "/etc/pki/tls/certs/ca-bundle.crt",                  // Fedora/RHEL 6
    "/etc/ssl/ca-bundle.pem",                            // OpenSUSE
    "/etc/pki/tls/cacert.pem",                           // OpenELEC
    "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem", // CentOS/RHEL 7
    "/etc/ssl/cert.pem",                                 // Alpine Linux
}

https://github.com/golang/go/blob/b49d8ce2fa66df6e201a3e7e89c42003e7b7a76a/src/crypto/x509/root_linux.go#L7-L15

could be very different from the way that Java does (no idea how), but the point is that it's hard to tell what's the right place is, and we recognize that can be improved.

it's true that hardcoding to mount at /etc/ssl/certs inside the container might not always work, but I'm afraid it's actually better than something that depends on the workers you're placed in (something the resource types should not care about).

sometime ago, @vito called out that it'd be nice to get some more though about this, asking for an RFC on it: https://github.com/concourse/rfcs/issues/9

nowadays, I think most of the movement around this is trying to better this with certificates at a more fundamental level - what we're thinking of as prototypes (in the case of resources, a resource prototype): https://github.com/concourse/rfcs/pull/37/commits/4b3c95d09de0b60cf46cd49cda766bf51e2df3af

that's to say that I think there's room for improvements on certificate propagation at a more fundamenttal level :thinking:

(btw, I just finished looking at the BOSH PR :grin: thanks for bringing up!)

please let me know what you think about it! happy to discuss any of the points above.

thx!

siennathesane commented 4 years ago

I think the easiest course of action right now while the design is being worked (even though it's been an open question for nearly 2 years) is to just expose a configuration option so platform owners can sort through the current state of things while the design pattern is being worked on long-term.

I could see there being some CONCOURSE_CERTS_MOUNT_PATH variable which defaults to SrcPath: volume.Path() but could be overridden to be SrcPath: config.CertsMountPath() or something similar. That way it doesn't really change the status quo, but also exposes enough to move past the default issue. That way operators can expose certificates on any given path if it's set.

I can likely put in that fix if that's a desirable path.

vito commented 4 years ago

@mxplusb It's been an open question for two years because no one has contributed a proposal. 😅 It seems like you have thoughts/experience with this, so if you'd like to see movement on it, channeling that energy towards helping - either by providing feedback of the existing proposals which touch on it, or writing a proposal of your own - would be more productive.

For example, resolving this issue is one of the goals of the prototypes RFC, but no one has provided any feedback on it. I have a rough idea of what we could do (explicitly configure certs somewhere and have them propagate to all resource types) - but without feedback I can't really know if these ideas would even help.

This is a very tricky problem to tackle with the "mounting arbitrary paths into containers" approach. Something is going to break, especially if the number of paths which may be clobbered in the container image becomes something resource type authors can't predict. Frankly I don't know if there's value in changing anything about the current approach - even with the path limited to just /etc/ssl/certs, we broke various Java-based resource types. Changing to other paths would likely be a lateral move that just breaks other images on your deployment.

siennathesane commented 4 years ago

I do agree it's a non-trivial problem, but I'm not using Concourse much these days (and I don't know when I'll use it again), so I'm not sure I can contribute beyond conversations for the foreseeable future.

I don't think prototypes are an appropriate vehicle for certificates in this context, as nearly all use cases I can think of leverage certificates as platform configuration, whereas prototypes (based on my understanding), are inherently designed around a tenancy artifact. The problem with certificates, in this context, is that the certificates can not only impact Concourse itself, but also the pipeline being executed. In my mind, that breaks the abstraction of the workflow engine and the pipeline construct, and prototypes are inherently designed around tenancy artifacts. If you break the abstraction between platform configuration and the workflow engine (Concourse), then you're opening up the platform configuration to be mutable, which seems like an inherent design problem. Granted, a self-deploying Concourse technically breaks this abstraction, but that's a deployment methodology, not an abstraction problem.

I think the way Concourse handles certificates for itself and how Concourse handles certificates for pipelines (maybe prototypes would be good) are two different problems, and this context revolves around how Concourse's own configuration doesn't work as intended.

So, thinking through it, I think the right course of action would be to separate the certificate problem into two separate constructs/designs; what certificates does Concourse need for itself (i.e. the AWS RDS root certificates to talk to it's database) and what certificates does a pipeline need and how does it consume them (maybe prototypes)?

I might have rambled but I think I explained well, lemme know if I didn't. :)

hemna commented 3 years ago

I am seeing this as well, trying to use docker-buildx-resource

selected worker: b13b868a1bbe

resource script '/opt/resource/check []' failed: exit status 1

stderr:
failed to ping registry: 2 error(s) occurred:

* ping https: Get "https://registry-1.docker.io/v2/": x509: certificate signed by unknown authority
* ping http: Get "https://registry-1.docker.io/v2/": x509: certificate signed by unknown authority
mjenk664 commented 3 years ago

Hi all,

I ran into this exact same issue with a customer I am currently working with and I think I may have found where the bottleneck/gap is when adding your self-signed/internal CA certificates to Concourse.

Initial Steps Performed:

  1. Added the internal CA certs to the BOSH Director's trusted-certificates
  2. Performed a bosh deploy -d concourse
  3. Logged into Concourse and could see the x509: certificate signed by unknown authority

After reading the Concourse Documentation on certificate propagation, it states that the Worker VMs should automatically propagate all of the certs in /etc/ssl/certs to the Resource Containers. Therefore, I didn't understand why we were still facing the x509 errors.

Steps to Fix the issue:

  1. Add the internal CA certs to the BOSH Director's trusted-certificates
  2. Recreate the deployment by running: bosh deploy -d concourse --recreate
  3. Login to Concourse and perform a check against the resource (the error should go away)

It appears that the normal bosh deploy of concourse does not "re-propagate" the certificates after the Worker VMs have already been created... I'm not sure why this is, but forcing a recreate of the deployment is what solved our issue.

I hope this helps others! Please let me know if this fixes the problem for you guys too