bluesky-social / ozone

web interface for labeling content in atproto / Bluesky
https://atproto.com
Other
281 stars 24 forks source link

DB connection being rejected with Zalando Postgres Operator #75

Open verdverm opened 7 months ago

verdverm commented 7 months ago

Hi, first off, thanks for making this awesome service open source! I'm really liking the ATProto paradigm

I'm having issues getting Ozone up in Kubernetes. The database connection appears to be rejected with the following error from the Ozone container.

I've verified the uname/pword do work with the psql client

kubectl logs -n bsky -f pod/ozone-84df5c9966-kh8vq
error: pg_hba.conf rejects connection for host "10.28.0.40", user "ts", database "ozone", no encryption
    at /usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:86974:15
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async PostgresDriver.acquireConnection (/usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:119099:20)
    at async RuntimeDriver.acquireConnection (/usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:117251:24)
    at async DefaultConnectionProvider.provideConnection (/usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:117174:24)
    at async DefaultQueryExecutor.executeQuery (/usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:114387:12)
    at async SelectQueryBuilder.execute (/usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:113294:20)
    at async PostgresIntrospector.getTables (/usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:119025:24)
    at async #doesTableExists (/usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:118841:20)
    at async #ensureMigrationTableExists (/usr/src/ozone/node_modules/@atproto/ozone/dist/index.js:118800:10) {
  length: 159,
  severity: 'FATAL',
  code: '28000',
  detail: undefined,
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: undefined,
  schema: undefined,
  table: undefined,
  column: undefined,
  dataType: undefined,
  constraint: undefined,
  file: 'auth.c',
  line: '468',
  routine: 'ClientAuthentication'
}

The Kubernetes manifest

apiVersion: v1
kind: List
items:
  - apiVersion: acid.zalan.do/v1
    kind: postgresql
    metadata:
      name: psql
      namespace: bsky
      labels:
        team: acid
    spec:
      teamId: acid
      volume:
        size: 2Gi
      numberOfInstances: 1
      users:
        zalando:
          - superuser
          - createdb
        ts:
          - login
      databases:
        ozone: ts
      postgresql:
        version: "15"
        parameters:
          password_encryption: scram-sha-256
  - apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: ozone
      namespace: bsky
      labels:
        app: ozone
    spec:
      selector:
        matchLabels:
          app: ozone
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxSurge: 1
          maxUnavailable: 0
      minReadySeconds: 5
      template:
        metadata:
          labels:
            app: ozone
        spec:
          containers:
            - name: ozone
              image: ghcr.io/bluesky-social/ozone:0.1.3
              imagePullPolicy: Always
              env:
                - name: NODE_ENV
                  value: production
                - name: DATABASE_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: ts.psql.credentials.postgresql.acid.zalan.do
                      key: password
                - name: OZONE_DB_POSTGRES_URL
                  value: postgresql://ts:$(DATABASE_PASSWORD)@psql:5432/ozone
                - name: OZONE_PUBLIC_URL
                  value: https://ozone.topicalsource.com
                - name: OZONE_DID_PLC_URL
                  value: https://plc.directory
                - name: OZONE_APPVIEW_URL
                  value: https://api.bsky.app
                - name: OZONE_APPVIEW_DID
                  value: did:web:api.bsky.app
                - name: OZONE_DB_MIGRATE
                  value: "1"
                - name: LOG_ENABLED
                  value: "1"
              envFrom:
                - secretRef:
                    name: ozone-env
              resources:
                requests:
                  cpu: 200m
                  memory: 256Mi
              ports:
                - containerPort: 3000
                  protocol: TCP
          volumes:
            - name: ozone-env
              secret:
                secretName: ozone-env

ENV vars in ozone-env

OZONE_SERVER_DID=...
OZONE_MODERATOR_DIDS=...
OZONE_ADMIN_DIDS=...
OZONE_ADMIN_PASSWORD=...
OZONE_SIGNING_KEY_HEX=...
verdverm commented 7 months ago

Followup, I can connect to psql from within the container

/usr/src/ozone # psql postgresql://ts:P8uvpCrEheVuCAPYYPD7Ogt5BaiVPHkO06IULHXLVP7AIl7BkQ48IGCM321VLX6T@psql:5432/ozone
psql (15.6)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

ozone=>

I added the following to the spec

          securityContext: {
            runAsUser: 0
            runAsGroup: 0
            fsGroup: 0
          }

and then exec'd in to run apk --update add postgresql-client and psql

verdverm commented 7 months ago

The config object (right before db connect) looks reasonable

https://github.com/bluesky-social/ozone/blob/main/service/index.js#L21

{
  service: {
    port: 3000,
    publicUrl: 'https://ozone.topicalsource.com',
    did: '...',
    version: '0.1.3',
    devMode: undefined
  },
  db: {
    postgresUrl: 'postgresql://ts:<...password...>@psql:5432/ozone',
    postgresSchema: 'public',
    poolSize: undefined,
    poolMaxUses: undefined,
    poolIdleTimeoutMs: undefined
  },
  appview: {
    url: 'https://api.bsky.app',
    did: 'did:web:api.bsky.app',
    pushEvents: false
  },
  pds: null,
  cdn: { paths: [] },
  identity: { plcUrl: 'https://plc.directory' },
  blobDivert: null,
  access: {
    admins: [ '...' ],
    moderators: [ '...' ],
    triage: []
  }
}

side note, I did set OZONE_VERSION=0.1.3 in the container. The 0.1.3 container was showing 0.1.1 by default

verdverm commented 7 months ago

Reproducing:

# Start kind cluster
kind create cluster  --name repro

# Add repo for postgres-operator
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator

# Install the postgres-operator
helm install postgres-operator postgres-operator-charts/postgres-operator

# Install the reproducer code
kubectl create namespace bsky
kubectl create secret generic -n bsky ozone-env --from-env-file reproducer.env       # extra env file from OP comment
kubectl apply -f reproducer.yaml                                                     # <--- the yaml content in OP comment

# Check the logs
kubectl get pods -n bsky
kubectl logs -n bsky <pod>

# Cleanup
kind delete cluster --name repro
verdverm commented 7 months ago

pg_hba.conf

root@psql-0:/home/postgres/pgdata/pgroot/data# cat pg_hba.conf
# Do not edit this file manually!
# It will be overwritten by Patroni!
local   all             all                                   trust
hostssl all             +zalandos    127.0.0.1/32       pam
host    all             all                127.0.0.1/32       md5
hostssl all             +zalandos    ::1/128            pam
host    all             all                ::1/128            md5
local   replication     standby                    trust
hostssl replication     standby all                md5
hostnossl all           all                all                reject
hostssl all             +zalandos    all                pam
hostssl all             all                all                md5
bnewbold commented 7 months ago

Just to be transparent, I think this is probably specific enough to your setup that we (Bluesky team) probably aren't going to jump in an help debug.

I do encourage you to post updates here though, might help others trying to do a similar deployment!

verdverm commented 7 months ago

@bnewbold understandable, but it might be a security hardening thing too, which might be in Bluesky interest

Looking at a pg container (docker hub postgres) that works, the pg_hba.conf is mostly set to trust, though maybe that last line is the one that matters

# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                     trust
# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
# IPv6 local connections:
host    all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     all             ::1/128                 trust

host all all all scram-sha-256
verdverm commented 7 months ago

Or maybe this line from above is what is blocking Ozone connection...?

hostnossl all all all reject

verdverm commented 7 months ago

More breadcrumbs indicating this reject line may be the problematic setting in Zalando

https://github.com/zalando/postgres-operator/issues/1034#issuecomment-984760984

verdverm commented 7 months ago

@bnewbold is there any interest on the Bluesky team for supporting SSL with DB connections?

https://node-postgres.com/features/ssl

verdverm commented 7 months ago

This looks like the easiest escape hatch in Zalando, will try in the coming days and report back

https://github.com/zalando/postgres-operator/blob/master/manifests/complete-postgres-manifest.yaml#L125

TimBurga commented 6 months ago

I just ran into the same problem in Azure and I think the resolution was to add this environment variable to the ozone container in the compose file:

PGSSLMODE: require