drakkan / sftpgo

Full-featured and highly configurable SFTP, HTTP/S, FTP/S and WebDAV server - S3, Google Cloud Storage, Azure Blob
https://sftpgo.com
GNU Affero General Public License v3.0
9.12k stars 714 forks source link

[Bug]: SFTPGo breaks after a few days with incorrect DB password on PostgreSQL #1752

Closed cristim closed 1 week ago

cristim commented 1 week ago

⚠️ This issue respects the following points: ⚠️

Bug description

We run SFTPGo in AWS, using the latest image in ECS Fargate and RDS PostgreSQL as database.

We noticed that after a few days of working fine the SFTPGo service breaks and the logs only show errors like mentioned below.

Looking at the DB metrics we noticed this pattern:

image

Restarting the container fixes it for a few days, and then it breaks again

Steps to reproduce

  1. Run sftpGo in AWS, using Fargate and RDS PostgreSQL
  2. Wait a few days
  3. The server will break with incorrect password and requires replacement to recover

Expected behavior

The service shouldn't break, or at least the load balancer health check should fail to allow us to recycle the container.

SFTPGo version

SFTPGo 2.6.2-636a1c2c-2024-06-21T17:30:20Z

Data provider

PostgreSQL, on AWS RDS

Installation method

Community Docker image

Configuration

We use the S3 backend, but that shouldn't matter for this.

Here's how our Fargate configuration looks like:

"environment": [
                {
                    "name": "SFTPGO_HTTPD__BINDINGS__0__PORT",
                    "value": "8080"
                },
                {
                    "name": "SFTPGO_DATA_PROVIDER__PORT",
                    "value": "5432"
                },
                {
                    "name": "AWS_REGION",
                    "value": "us-east-1"
                },
                {
                    "name": "SFTPGO_DEFAULT_FILESYSTEM__S3CONFIG__REGION",
                    "value": "us-east-1"
                },
                {
                    "name": "SFTPGO_LOG_HTTP_RESPONSE",
                    "value": "1"
                },
                {
                    "name": "SFTPGO_DEFAULT_FILESYSTEM__PROVIDER",
                    "value": "1"
                },
                {
                    "name": "SFTPGO_DEFAULT_FILESYSTEM__S3CONFIG__BUCKET",
                    "value": "xxxxxxxx"
                },
                {
                    "name": "SFTPGO_HTTPD__SETUP__SKIP_PATHS",
                    "value": "[\"/healthz\"]"
                },
                {
                    "name": "SFTPGO_LOG_LEVEL",
                    "value": "debug"
                },
                {
                    "name": "SFTPGO_DATA_PROVIDER__DRIVER",
                    "value": "postgresql"
                },
                {
                    "name": "SFTPGO_DEFAULT_ADMIN_USERNAME",
                    "value": "admin"
                },
                {
                    "name": "SFTPGO_DATA_PROVIDER__NAME",
                    "value": "ftp"
                },
                {
                    "name": "SFTPGO_DEFAULT_USER__HOME_DIR",
                    "value": "/srv/sftpgo/data/%username%"
                },
                {
                    "name": "SFTPGO_DATA_PROVIDER__CREATE_DEFAULT_ADMIN",
                    "value": "true"
                },
                {
                    "name": "S3_STORAGE_BUCKET",
                    "value": "xxxxxx"
                },
                {
                    "name": "SFTPGO_DEFAULT_FILESYSTEM__S3CONFIG__KEY_PREFIX",
                    "value": "/%username%/"
                },
                {
                    "name": "SFTPGO_DATA_PROVIDER__HOST",
                    "value": "xxxxxx.rds.amazonaws.com"
                },
                {
                    "name": "SFTPGO_SFTPD__HOST_KEYS",
                    "value": "/etc/ssh/ssh_host_rsa_key"
                },
                {
                    "name": "SFTPGO_LOG_HTTP_REQUEST",
                    "value": "1"
                },
                {
                    "name": "SFTPGO_SFTPD__BINDINGS__0__PORT",
                    "value": "2222"
                }

"secrets": [
                {
                    "name": "SFTPGO_DATA_PROVIDER__USERNAME",
                    "valueFrom": "arn:aws:secretsmanager:us-east-1:xxxxx:secret:rds!db-037386b2-73af-4583-b059-833022a348e5-bwXcWw:username::"
                },
                {
                    "name": "SFTPGO_DATA_PROVIDER__PASSWORD",
                    "valueFrom": "arn:aws:secretsmanager:us-east-1:xxxxx:secret:rds!db-037386b2-73af-4583-b059-833022a348e5-bwXcWw:password::"
                },
                {
                    "name": "SFTPGO_DEFAULT_ADMIN_PASSWORD",
                    "valueFrom": "arn:aws:secretsmanager:us-east-1:xxxxx:secret:/sftpgo/admin_password-IZmHvt"
                }

Relevant log output

{"level":"debug","time":"2024-08-30T16:59:39.326","sender":"sftpd","message":"failed to accept an incoming connection from ip \"10.20.101.161\": [Authentication error: could not validate password credentials: failed to connect to `user=ftp database=ftp`: 10.20.1.58:5432 (xxxxx.rds.amazonaws.com): failed SASL auth: FATAL: password authentication failed for user \"ftp\" (SQLSTATE 28P01), Authentication error: could not validate keyboard interactive credentials: failed to connect to `user=ftp database=ftp`: 10.20.1.58:5432 (xxxxx.rds.amazonaws.com): failed SASL auth: FATAL: password authentication failed for user \"ftp\" (SQLSTATE 28P01)]"}

In spite of this, failure the load balancer health checks keep returning a healthy status.

{"level":"debug","time":"2024-08-30T16:34:42.684","sender":"httpd","local_addr":"10.20.2.101:8080","method":"GET","proto":"HTTP/1.1","remote_addr":"10.20.2.209:41646","request_id":"ip-10-20-2-101.ec2.internal/pk7Xu7rs8o-290082","uri":"http://10.20.2.101:8080/healthz","user_agent":"ELB-HealthChecker/2.0","resp_status":200,"resp_size":2,"elapsed_ms":0}

What are you using SFTPGo for?

Medium business

Additional info

No response

cristim commented 1 week ago

I also noticed that failures happen at "interesting" times of the day:

last failure was at 2024-08-26 12:00 UTC the other failure was at 2024-08-11 23:00 UTC and another at 2024-08-04 23:00 UTC

So it seems there's a sort of time-based trigger to these failures

cristim commented 1 week ago

I finally traced this down to the automated secret rotation in RDS, and we expect this issue to be solved now that we just disabled secret rotation.

It would be nice to have a way to get sftpgo to integrate nicely with RDS secret rotation.