xtreme-sameer-vohra commented 4 years ago

We would like to migrate the CI (prod) deployment to run on K8s

TODOs v1

[x] Create address & DNS entry for nci.concourse-ci.org
[x] Create a worker node pool
[x] Create a CloudSQL instance for CI
[x] Helm deploy workers
[x] Helm deploy web nodes
[x] Helm deploy vault
[x] Move secrets over
[x] Setup auth for the vault server
[x] Ensure Vault is secure Prod Deployment checklist
[x] Migrate the pipelines ( choose which ones need to be migrated )
[x] Migrate example pipelines used in https://concourse-ci.org/examples.html #38
[x] Audit & reconcile Bosh manifest values vs. K8s Chart values
[x] Move over DNS entry from nci to ci making it official #39
[x] Logging solution #40
[x] add X-Frame-Option #41
- https://github.com/concourse/concourse-chart/issues/22
[ ] Unmute datadog Hush-House SLI alert #42
[x] have baggageclaim pipeline setup
- https://github.com/concourse/ci/issues/221

TODOS v2

[ ] Point workers to the new CI env
- [x] darwin
- [x] windows
- [ ] k8s-topgun
- https://github.com/concourse/ci/issues/192
- [ ] bosh-topgun needs to be continuously deployed - #43
[x] Move untrusted workers into K8s: concourse/ci#211
[x] Move metrics for CI over to K8s
[ ] @clarafu @vito check that there's a backup before upgrading to v6
[ ] Scale down BOSH deployed prod/ci env to 1 or 0 ?? #46
[ ] switch deployment to use the concourse-rc, Setup CI to auto deploy CI #45
[ ] #44 Pick a scalable storage backend for vault ( such that backups, snapshots are much easier to manage ), auto-unseal with Google Cloud Key Management Service (extraEnvVars in values.yaml, auto-unseal settings in vault config file, sample terraform config for gcpkms, terraform docs for gcpkms)
switch to datadawg if grafana gets hard to manage

deniseyu commented 4 years ago

Deployed a new Vault instance in the gke-hush-house-generic-<some stuff>-2k5g node and it is the only pod in the vault namespace. All prod secrets (at least, the ones we grabbed around 3pm yesterday) have been restored onto it!!

deniseyu commented 4 years ago

If anything goes wrong with this restore and we have to do it again, here's updated operating docs on backing up and restoring Vault from BOSH prod:

https://github.com/pivotal/concourse-ops/wiki/Operating-Vault#backing-up-vault-secrets

xtreme-sameer-vohra commented 4 years ago

Concourse has been deployed using https://github.com/concourse/hush-house/tree/master/deployments/with-creds/ci and is available at https://nci.concourse-ci.org/

xtreme-sameer-vohra commented 4 years ago

Plan is to move it over to ci.concourse-ci.org when we're ready for the switch over

cirocosta commented 4 years ago

Hey, I noticed that vault is not reserving & limiting resources (https://github.com/concourse/hush-house/blob/master/deployments/with-creds/vault/values.yaml#L3) - it'd be good to add that to the list 😁

something similar to how we do for web

    resources:
      limits:   { cpu: 500m, memory: 256Mi }
      requests: { cpu: 500m, memory: 256Mi  }

https://github.com/concourse/hush-house/blob/9d4c53fd7c3da2f5477be594492c38cda1b05ddf/deployments/with-creds/ci/values.yaml#L55-L57

cirocosta commented 4 years ago

Add the papertrail

currently, we use stackdriver logs for hush-house 🤔

it might we worth considering whether we have discounts for that (in which case, getting out of papertrail would then be considered a $ pro 😁 )

at some point, we'd need to get the main pipeline continuously redeploying this environment - it might be a thing for another issue (as there are details of like, which resource-type to use), but it'll definitely be something to think about at some point

cirocosta commented 4 years ago

Move metrics for CI over to K8s

that's (in theory) all automatically set up 😁 if you go to metrics-hush-house and change the values in the dropdown to point to the desired namespace, you'll see the metrics for those installations.

naturally, that's just "in theory" hahah, we never really exercised those dashboards with more than 1 deployment (hush-house)

deniseyu commented 4 years ago

I think the callback URL for login is misconfigured - I know it's not ready yet but I was chomping at the bit to try to get 5.7.1 kicked off 😂 and tried to do GitHub login, but got an error and was redirected to Hush House.

vito commented 4 years ago

I've updated the client ID and secret in LastPass (hush-house-values-ci). Once the chart is re-deployed with that the redirect should be unborked. (I made a new OAuth application for it.)

xtreme-sameer-vohra commented 4 years ago

Update

The new github client is configured, so folks can login and it redirects properly
Going to look at the Concourse + Vault integration now

@cirocosta @kcmannem was mentioning that stackdriver is super slow and we'd prefer papertrail, something to dicuss

xtreme-sameer-vohra commented 4 years ago

Is there a good checklist we can use to ensure vault is configured securely

Of the top of my head, but not substantive by any means;

vault is not reachable from outside of the cluster
vault is not reachable by any workers in the ci and hush-house and any other deployments that may be added to the k8s cluster
vault is configured with tls

xtreme-sameer-vohra commented 4 years ago

Another thought, should we even use vault on K8s rather than K8s secrets ?

cirocosta commented 4 years ago

vault is not reachable from outside of the cluster vault is not reachable by any workers in the ci [...]

when it comes to reachability, I'd also add:

vault is reachable only by web pods
vault (the pod as a whole) cannot reach any services (all eggress blocked)¹, otherwise we're pretty much allowing ingress indirectly (as TCP conns are bidirectional once established)

(to enable these ^ we'd need to enable the use of net policies in the cluster though)

however, should we really care? if we're already assuming that we face intra-network threats and should protect ourselves against it, that's pretty much already distrusting the network completly, to a point where we're protected enough by using techniques such as mtls. Thus, should we care where in the network we are?

vito commented 4 years ago

Another thought, should we even use vault on K8s rather than K8s secrets ?

I'd like to keep using Vault for dogfooding purposes mainly.

zoetian commented 4 years ago

Since the /vault/data/auth directory got copied directly over from the BOSH-deployed Vault server, the auth policies were preserved so it should be possible for Concourse to use the same TLS cert to authenticate as before - however, when testing this we started to see this error:

$ vault login -method=cert
Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/cert/login
Code: 400. Errors:

* tls connection required

(and a similar error appears in the logs of the web pod when it is configured to use the same cert for authentication). It seems reasonable to conclude that TLS must be enabled in order to use a TLS cert for authentication. We generated a self-signed cert with vault.vault.svc.cluster.local as a Common Name (not a SAN) and were in the process of adding the vaultCaCert secret into the Concourse chart, but we got a bit stuck figuring out what the fields of the k8s secret required for the vault server's TLS configuration needed to be. If anyone can suggest a template for this secret, that would be very helpful. (@cirocosta?) We will pick this work back up tomorrow in the late morning.

In general, this work is pretty significantly slowed down by other interruptions.

pivotal-bin-ju commented 4 years ago

Would prod.concourse-ci.org make more sense than nci.concourse-ci.org? We already have ci in the domain name.

On Mon, Nov 11, 2019 at 4:50 PM Zoe Tian notifications@github.com wrote:

Since the /vault/data/auth directory got copied directly over from the BOSH-deployed Vault server, the auth policies were preserved so it should be possible for Concourse to use the same TLS cert to authenticate as before - however, when testing this we started to see this error:

$ vault login -method=cert Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/cert/login Code: 400. Errors:

tls connection required

(and a similar error appears in the logs of the web pod when it is configured to use the same cert for authentication). It seems reasonable to conclude that TLS must be enabled in order to use a TLS cert for authentication. We generated a self-signed cert with vault.vault.svc.cluster.local as a Common Name (not a SAN) and were in the process of adding the vaultCaCert secret into the Concourse chart, but we got a bit stuck figuring out what the fields of the k8s secret required for the vault server's TLS configuration https://urldefense.proofpoint.com/v2/url?u=https-3A__www.vaultproject.io_docs_platform_k8s_helm.html-23standalone-2Dserver-2Dwith-2Dtls&d=DwMFaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=x2xYZB6vYNrWd2g7NJBsBg&m=EpEjenpVSboLCYK4Pn61z2A7VTMV3vAq3ONo5iAFt80&s=hinz-p1BmvTnK0BHyC5C_hwoCUxQx3uhqCnxdyWiobs&e= needed to be. If anyone can suggest a template for this secret, that would be very helpful. (@cirocosta https://github.com/cirocosta?) We will pick this work back up tomorrow in the late morning.

In general, this work is pretty significantly slowed down by other interruptions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/concourse/prod/issues/36?email_source=notifications&email_token=AEVDUM2M5BNOTZM7ABBEPHDQTHHQ3A5CNFSM4JJ2RYO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDYIAKQ#issuecomment-552632362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVDUM33EGAREXL2AJ5EVH3QTHHQ3ANCNFSM4JJ2RYOQ .

-- Bin Ju 鞠国滨 +1(647)835 6177

jamieklassen commented 4 years ago

learnings from yesterday:

the test pipeline referred to a secret under /concourse/resources so we had to move it into a resources team
we didn't pinpoint the exact circumstances when this was required, but a couple times we got a deploy of the ci chart working by incrementing web.annotations.rollingUpdate.

learnings from today:

when deploying the vault chart, we had to set HELM_FLAGS=--recreate-pods in some cases, otherwise our changes to the chart did not get honored.
we had to make sure the self-signed cert we generated for use by the vault server had a DNS SAN that matched the host in CONCOURSE_VAULT_URL and an IP SAN of 127.0.0.1. Using the openssl CLI didn't seem to support specifying SANs very easily, and using bosh int resulted in certs with no SANs too. https://certificatetools.com was eventually what we used.
any vault CLI commands run while execed into the vault pod need the -ca-cert=/vault/userconfig/vault-server-tls/vault.ca flag in order to work.

next steps:

put the new secrets (vault server ca/cert/key, concourse client cert/key) into lpass
commit configuration changes
parameterize vault/templates/vault-tls-secret.yml so that the values can be populated from a .values.yaml file
update the wiki entry appropriately - describe how we generated vault's certs (since that cert will expire in 1 year), mention the -ca-cert flag in the existing instructions.

zoetian commented 4 years ago

today we committed our changes to the chart and documented the process of rotating the vault TLS cert. tomorrow, we will work on updating https://github.com/concourse/ci/blob/master/pipelines/reconfigure.yml to include a new job, reconfigure-resource-pipelines, which will have, for each base resource type:

a task step to render the jsonnet template for that resource type's pipeline
a put step to set the pipeline for that resource type

cirocosta commented 4 years ago

Hey,

Aside from those tasks above, there's a set of small fixes that we need to apply to make the pipeline move to further steps:

[ ] k8s-smoke currently fails due to chart path lookup
- perhaps https://github.com/concourse/ci/pull/200?
[ ] k8s-check-helm-params
- perhaps https://github.com/concourse/ci/pull/200?

Error: found some variables supported by the Concourse binary that are missing from the helm packaging:

CONCOURSE_CONFIG_RBAC
CONCOURSE_GARDEN_REQUEST_TIMEOUT
CONCOURSE_GARDEN_USE_CONTAINERD
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_ANY_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_AUDITOR_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_DEVELOPER_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_MANAGER_ROLE
CONCOURSE_NEWRELIC_BATCH_DISABLE_COMPRESSION
CONCOURSE_NEWRELIC_BATCH_DURATION
CONCOURSE_NEWRELIC_BATCH_SIZE

[x] bosh-check-props


Error: found some variables in the bosh packaging that might not be supported by the Concourse binary:

CONCOURSE_MAIN_TEAM_CONFIG
CONCOURSE_MAIN_TEAM_MICROSOFT_GROUP
CONCOURSE_MAIN_TEAM_MICROSOFT_USER
CONCOURSE_MICROSOFT_CLIENT_ID
CONCOURSE_MICROSOFT_CLIENT_SECRET
CONCOURSE_MICROSOFT_GROUPS
CONCOURSE_MICROSOFT_ONLY_SECURITY_GROUPS
CONCOURSE_MICROSOFT_TENANT
Error: found some variables supported by the Concourse binary that are missing from the bosh packaging:
CONCOURSE_CONFIG_RBAC
CONCOURSE_GARDEN_USE_CONTAINERD
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_ANY_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_AUDITOR_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_DEVELOPER_ROLE
CONCOURSE_MAIN_TEAM_CF_SPACE_WITH_MANAGER_ROLE

thanks!

cirocosta commented 4 years ago

update: we got most of those reduced to a fewer number of flags, but concourse/ci#200 is still not merged yet, so we stopped going forward w/ the helm-related changes

vito commented 4 years ago

The Vault node got bounced last night and became sealed, so I manually went in and unsealed it using the credentials in LastPass. We should probably find a way to auto-unseal or something so this isn't a constant burden. :thinking:

deniseyu commented 4 years ago

https://learn.hashicorp.com/vault/operations/autounseal-gcp-kms

jamieklassen commented 4 years ago

@vito @deniseyu I edited @xtreme-sameer-vohra's top comment with some useful links

kcmannem commented 4 years ago

I've added a task to migrate example pipelines used in https://concourse-ci.org/examples.html to the new cluster.

though I'm not sure where the configs for these lie.

kcmannem commented 4 years ago

Took a look at the diffs between ci-house and prod configs. Here's the diff:

MISSING FROM CI-HOUSE

For untrusted workers we have to setup deny networks

garden:
    deny_networks:
    - 10.0.0.0/16

We deny our host network on the pr workers as we don't want to expose communication to workloads coming externally. I don't know the host network pool in gke we use but we already deny a 169.x.x.x subnet so this might already be taken care of.

On the ATC:

default_task_cpu_limit: 1024
default_task_memory_limit: 5GB

x_frame_options: ""

Idk if we wanna continue using these limits.

On the worker:

volume_sweeper_max_in_flight: 3

I'm going to choose to skip this, by default this value is set to 3. It was set manually because btrfs use to be unstable when we hit the driver too hard.

cirocosta commented 4 years ago

Hey,

I don't know the host network pool in gke

module "vpc" {
  source = "./vpc"

  name   = "${var.name}"
  region = "${var.region}"

  vms-cidr      = "10.10.0.0/16"
  pods-cidr     = "10.11.0.0/16"
  services-cidr = "10.12.0.0/16"
}

https://github.com/concourse/hush-house/blob/a14d0832ecac5753c138a9287e12a3be375cc1a5/terraform/cluster/main.tf#L7-L9

we use but we already deny a 169.x.x.x subnet so this might already be taken care of.

the block to 169.254.169.254/32 is only to avoid queries to GCP's metadata server.

(https://github.com/concourse/hush-house/blob/a14d0832ecac5753c138a9287e12a3be375cc1a5/deployments/with-creds/ci-pr/values.yaml#L33-L34)

in https://github.com/concourse/hush-house/pull/75 we tackled most of the issues w/ regards to reaching out to other workloads in the cluster, but I don't think a block on all 10.0.0.0/8 might hurt with the current configuration (DNS would go to a 10.x.x.x address - kube-dns - but that'd originate from from the concourse process through the dns forwarding that we perform).

(on https://github.com/concourse/hush-house/issues/80 I describe how we could & should protect that a bit more)

cirocosta commented 4 years ago

Idk if we wanna continue using these limits.

as long as we're using COS (IIRC, we are for nci), we'll be able to set those values

    onGke(func() {
        containerLimitsWork(COS, TaskCPULimit, TaskMemoryLimit)
        containerLimitsFail(UBUNTU, TaskCPULimit, TaskMemoryLimit)
    })

(from https://github.com/concourse/concourse/blob/a8e001f8e655b442f34ebe8909267747f897469b/topgun/k8s/container_limits_test.go#L26-L29)

but yeah, I'd personally not set them - @vito might have opinions on it? I don't think I was around when we put that on ci

kcmannem commented 4 years ago

@cirocosta thanks!

kcmannem commented 4 years ago

idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search.

Here's a link on how to set it up, if we still want to use papertrail: https://help.papertrailapp.com/kb/configuration/configuring-centralized-logging-from-kubernetes/

cirocosta commented 4 years ago

idk if this convo happened already but i'd like to keep using papertrail, i find stackdriver really slow and hard to search.

@kcmannem , my biggest reason for going w/ stackdriver would be to leverage logs-based metrics (which we already do for hush-house, but not for ci yet - https://github.com/concourse/hush-house/issues/78). I'm not sure if there'd be a way of doing container logs -> log aggregation service -> grafana graphing in a simple way for papertrail (if we went full datadog, that'd be possible using datadog's log parsing, etc).

I found that at least for hush-house having the logs graphed made me pretty much never have to go search for log messages - when I did, I knew exactly what to search for (and under which range - as the dashboard would already show)

concourse / prod

Move CI (prod) to run on K8s #36

Update