concourse / prod

bosh/terraform config for our deployments
3 stars 5 forks source link

Move CI (prod) to run on K8s #36

Open xtreme-sameer-vohra opened 4 years ago

xtreme-sameer-vohra commented 4 years ago

We would like to migrate the CI (prod) deployment to run on K8s

TODOs v1

TODOS v2

deniseyu commented 4 years ago

Deployed a new Vault instance in the gke-hush-house-generic-<some stuff>-2k5g node and it is the only pod in the vault namespace. All prod secrets (at least, the ones we grabbed around 3pm yesterday) have been restored onto it!!

deniseyu commented 4 years ago

If anything goes wrong with this restore and we have to do it again, here's updated operating docs on backing up and restoring Vault from BOSH prod:

https://github.com/pivotal/concourse-ops/wiki/Operating-Vault#backing-up-vault-secrets

xtreme-sameer-vohra commented 4 years ago

Concourse has been deployed using https://github.com/concourse/hush-house/tree/master/deployments/with-creds/ci and is available at https://nci.concourse-ci.org/

xtreme-sameer-vohra commented 4 years ago

Plan is to move it over to ci.concourse-ci.org when we're ready for the switch over

cirocosta commented 4 years ago

Hey, I noticed that vault is not reserving & limiting resources (https://github.com/concourse/hush-house/blob/master/deployments/with-creds/vault/values.yaml#L3) - it'd be good to add that to the list 😁

something similar to how we do for web

    resources:
      limits:   { cpu: 500m, memory: 256Mi }
      requests: { cpu: 500m, memory: 256Mi  }

https://github.com/concourse/hush-house/blob/9d4c53fd7c3da2f5477be594492c38cda1b05ddf/deployments/with-creds/ci/values.yaml#L55-L57

cirocosta commented 4 years ago

Add the papertrail

currently, we use stackdriver logs for hush-house πŸ€”

it might we worth considering whether we have discounts for that (in which case, getting out of papertrail would then be considered a $ pro 😁 )


at some point, we'd need to get the main pipeline continuously redeploying this environment - it might be a thing for another issue (as there are details of like, which resource-type to use), but it'll definitely be something to think about at some point

cirocosta commented 4 years ago

Move metrics for CI over to K8s

that's (in theory) all automatically set up 😁 if you go to metrics-hush-house and change the values in the dropdown to point to the desired namespace, you'll see the metrics for those installations.

naturally, that's just "in theory" hahah, we never really exercised those dashboards with more than 1 deployment (hush-house)

deniseyu commented 4 years ago

I think the callback URL for login is misconfigured - I know it's not ready yet but I was chomping at the bit to try to get 5.7.1 kicked off πŸ˜‚ and tried to do GitHub login, but got an error and was redirected to Hush House.

vito commented 4 years ago

I've updated the client ID and secret in LastPass (hush-house-values-ci). Once the chart is re-deployed with that the redirect should be unborked. (I made a new OAuth application for it.)

xtreme-sameer-vohra commented 4 years ago

Update

@cirocosta @kcmannem was mentioning that stackdriver is super slow and we'd prefer papertrail, something to dicuss

xtreme-sameer-vohra commented 4 years ago

Is there a good checklist we can use to ensure vault is configured securely

Of the top of my head, but not substantive by any means;

xtreme-sameer-vohra commented 4 years ago

Another thought, should we even use vault on K8s rather than K8s secrets ?

cirocosta commented 4 years ago

vault is not reachable from outside of the cluster vault is not reachable by any workers in the ci [...]

when it comes to reachability, I'd also add:

(to enable these ^ we'd need to enable the use of net policies in the cluster though)

however, should we really care? if we're already assuming that we face intra-network threats and should protect ourselves against it, that's pretty much already distrusting the network completly, to a point where we're protected enough by using techniques such as mtls. Thus, should we care where in the network we are?

vito commented 4 years ago

Another thought, should we even use vault on K8s rather than K8s secrets ?

I'd like to keep using Vault for dogfooding purposes mainly.

zoetian commented 4 years ago

Since the /vault/data/auth directory got copied directly over from the BOSH-deployed Vault server, the auth policies were preserved so it should be possible for Concourse to use the same TLS cert to authenticate as before - however, when testing this we started to see this error:

$ vault login -method=cert
Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/cert/login
Code: 400. Errors:

* tls connection required

(and a similar error appears in the logs of the web pod when it is configured to use the same cert for authentication). It seems reasonable to conclude that TLS must be enabled in order to use a TLS cert for authentication. We generated a self-signed cert with vault.vault.svc.cluster.local as a Common Name (not a SAN) and were in the process of adding the vaultCaCert secret into the Concourse chart, but we got a bit stuck figuring out what the fields of the k8s secret required for the vault server's TLS configuration needed to be. If anyone can suggest a template for this secret, that would be very helpful. (@cirocosta?) We will pick this work back up tomorrow in the late morning.

In general, this work is pretty significantly slowed down by other interruptions.

pivotal-bin-ju commented 4 years ago

Would prod.concourse-ci.org make more sense than nci.concourse-ci.org? We already have ci in the domain name.

On Mon, Nov 11, 2019 at 4:50 PM Zoe Tian notifications@github.com wrote:

Since the /vault/data/auth directory got copied directly over from the BOSH-deployed Vault server, the auth policies were preserved so it should be possible for Concourse to use the same TLS cert to authenticate as before - however, when testing this we started to see this error:

$ vault login -method=cert Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/cert/login Code: 400. Errors:

  • tls connection required

(and a similar error appears in the logs of the web pod when it is configured to use the same cert for authentication). It seems reasonable to conclude that TLS must be enabled in order to use a TLS cert for authentication. We generated a self-signed cert with vault.vault.svc.cluster.local as a Common Name (not a SAN) and were in the process of adding the vaultCaCert secret into the Concourse chart, but we got a bit stuck figuring out what the fields of the k8s secret required for the vault server's TLS configuration https://urldefense.proofpoint.com/v2/url?u=https-3A__www.vaultproject.io_docs_platform_k8s_helm.html-23standalone-2Dserver-2Dwith-2Dtls&d=DwMFaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=x2xYZB6vYNrWd2g7NJBsBg&m=EpEjenpVSboLCYK4Pn61z2A7VTMV3vAq3ONo5iAFt80&s=hinz-p1BmvTnK0BHyC5C_hwoCUxQx3uhqCnxdyWiobs&e= needed to be. If anyone can suggest a template for this secret, that would be very helpful. (@cirocosta https://github.com/cirocosta?) We will pick this work back up tomorrow in the late morning.

In general, this work is pretty significantly slowed down by other interruptions.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/concourse/prod/issues/36?email_source=notifications&email_token=AEVDUM2M5BNOTZM7ABBEPHDQTHHQ3A5CNFSM4JJ2RYO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDYIAKQ#issuecomment-552632362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVDUM33EGAREXL2AJ5EVH3QTHHQ3ANCNFSM4JJ2RYOQ .

-- Bin Ju ιž ε›½ζ»¨ +1(647)835 6177

jamieklassen commented 4 years ago

learnings from yesterday:

learnings from today:

next steps:

zoetian commented 4 years ago

today we committed our changes to the chart and documented the process of rotating the vault TLS cert. tomorrow, we will work on updating https://github.com/concourse/ci/blob/master/pipelines/reconfigure.yml to include a new job, reconfigure-resource-pipelines, which will have, for each base resource type:

cirocosta commented 4 years ago

Hey,

Aside from those tasks above, there's a set of small fixes that we need to apply to make the pipeline move to further steps:

Error: found some variables supported by the Concourse binary that are missing from the helm packaging:

CONCOURSE_CONFIG_RBAC
CONCOURSE_GARDEN_REQUEST_TIMEOUT
CONCOURSE_GARDEN_USE_CONTAINERD
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_ANY_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_AUDITOR_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_DEVELOPER_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_MANAGER_ROLE
CONCOURSE_NEWRELIC_BATCH_DISABLE_COMPRESSION
CONCOURSE_NEWRELIC_BATCH_DURATION
CONCOURSE_NEWRELIC_BATCH_SIZE

Error: found some variables in the bosh packaging that might not be supported by the Concourse binary:

CONCOURSE_MAIN_TEAM_CONFIG
CONCOURSE_MAIN_TEAM_MICROSOFT_GROUP
CONCOURSE_MAIN_TEAM_MICROSOFT_USER
CONCOURSE_MICROSOFT_CLIENT_ID
CONCOURSE_MICROSOFT_CLIENT_SECRET
CONCOURSE_MICROSOFT_GROUPS
CONCOURSE_MICROSOFT_ONLY_SECURITY_GROUPS
CONCOURSE_MICROSOFT_TENANT
Error: found some variables supported by the Concourse binary that are missing from the bosh packaging:
CONCOURSE_CONFIG_RBAC
CONCOURSE_GARDEN_USE_CONTAINERD
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_ANY_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_AUDITOR_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_DEVELOPER_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_MANAGER_ROLE

thanks!

cirocosta commented 4 years ago

update: we got most of those reduced to a fewer number of flags, but concourse/ci#200 is still not merged yet, so we stopped going forward w/ the helm-related changes

vito commented 4 years ago

The Vault node got bounced last night and became sealed, so I manually went in and unsealed it using the credentials in LastPass. We should probably find a way to auto-unseal or something so this isn't a constant burden. :thinking:

deniseyu commented 4 years ago

https://learn.hashicorp.com/vault/operations/autounseal-gcp-kms

jamieklassen commented 4 years ago

@vito @deniseyu I edited @xtreme-sameer-vohra's top comment with some useful links

kcmannem commented 4 years ago

I've added a task to migrate example pipelines used in https://concourse-ci.org/examples.html to the new cluster.

though I'm not sure where the configs for these lie.

kcmannem commented 4 years ago

Took a look at the diffs between ci-house and prod configs. Here's the diff:

MISSING FROM CI-HOUSE

For untrusted workers we have to setup deny networks

garden:
    deny_networks:
    - 10.0.0.0/16

We deny our host network on the pr workers as we don't want to expose communication to workloads coming externally. I don't know the host network pool in gke we use but we already deny a 169.x.x.x subnet so this might already be taken care of.

On the ATC:

default_task_cpu_limit: 1024
default_task_memory_limit: 5GB

x_frame_options: ""

Idk if we wanna continue using these limits.

On the worker:

volume_sweeper_max_in_flight: 3

I'm going to choose to skip this, by default this value is set to 3. It was set manually because btrfs use to be unstable when we hit the driver too hard.

cirocosta commented 4 years ago

Hey,

I don't know the host network pool in gke

module "vpc" {
  source = "./vpc"

  name   = "${var.name}"
  region = "${var.region}"

  vms-cidr      = "10.10.0.0/16"
  pods-cidr     = "10.11.0.0/16"
  services-cidr = "10.12.0.0/16"
}

https://github.com/concourse/hush-house/blob/a14d0832ecac5753c138a9287e12a3be375cc1a5/terraform/cluster/main.tf#L7-L9


we use but we already deny a 169.x.x.x subnet so this might already be taken care of.

the block to 169.254.169.254/32 is only to avoid queries to GCP's metadata server.

(https://github.com/concourse/hush-house/blob/a14d0832ecac5753c138a9287e12a3be375cc1a5/deployments/with-creds/ci-pr/values.yaml#L33-L34)

in https://github.com/concourse/hush-house/pull/75 we tackled most of the issues w/ regards to reaching out to other workloads in the cluster, but I don't think a block on all 10.0.0.0/8 might hurt with the current configuration (DNS would go to a 10.x.x.x address - kube-dns - but that'd originate from from the concourse process through the dns forwarding that we perform).

(on https://github.com/concourse/hush-house/issues/80 I describe how we could & should protect that a bit more)

cirocosta commented 4 years ago

Idk if we wanna continue using these limits.

as long as we're using COS (IIRC, we are for nci), we'll be able to set those values

    onGke(func() {
        containerLimitsWork(COS, TaskCPULimit, TaskMemoryLimit)
        containerLimitsFail(UBUNTU, TaskCPULimit, TaskMemoryLimit)
    })

(from https://github.com/concourse/concourse/blob/a8e001f8e655b442f34ebe8909267747f897469b/topgun/k8s/container_limits_test.go#L26-L29)

but yeah, I'd personally not set them - @vito might have opinions on it? I don't think I was around when we put that on ci

kcmannem commented 4 years ago

@cirocosta thanks!

kcmannem commented 4 years ago

idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search.

Here's a link on how to set it up, if we still want to use papertrail: https://help.papertrailapp.com/kb/configuration/configuring-centralized-logging-from-kubernetes/

cirocosta commented 4 years ago

idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search.

@kcmannem , my biggest reason for going w/ stackdriver would be to leverage logs-based metrics (which we already do for hush-house, but not for ci yet - https://github.com/concourse/hush-house/issues/78). I'm not sure if there'd be a way of doing container logs -> log aggregation service -> grafana graphing in a simple way for papertrail (if we went full datadog, that'd be possible using datadog's log parsing, etc).

I found that at least for hush-house having the logs graphed made me pretty much never have to go search for log messages - when I did, I knew exactly what to search for (and under which range - as the dashboard would already show)