clyso / chorus

s3 multi provider data lifecycle management
Apache License 2.0
54 stars 6 forks source link

Unable to use docker compose of chorus with isSecure: false #37

Closed LewisSDAFNI closed 1 month ago

LewisSDAFNI commented 1 month ago

While trying to use Chorus to replicate data from staging instances of Minio in one kubernetes cluster (old cluster, oc) to another (new cluster, nc) we encounter a few issues.

We are using docker compose to build Chorus with an agent :

docker compose -f ./docker-compose/docker-compose.yml --profile agent up --force-recreate

The connection to main seems to be configuring fine, however follower is giving us the problems. We have assigned two urls to our nc staging Minio instance, one uses http and one is https, both have self-signed certificates.

IF we use the https url which assigned to the follower Minio instance, the composition is fine, and chorctl dash and storage work as expected. But bucket replication fails due to the https route not having a valid certificate. If we were able to convince Chorus to trust our self-signed certificates this issue would be resolved, without this we are not currently able to generate a trust x.509 certificate that it is requesting.

IF we use the http url, with the necessary flags isSecure: false the composition fails on the worker with the following:

worker-1  | {"level":"error","error":"unable to create client for storage \"follower\": s3 is offline: The request signature we calculated 
does not match the signature you provided. Check your key and signing method.","time":"2024-07-26T09:36:40Z","message":"critical 
error. Shutdown application"}`

This Chorus instance is hosted on a virtual machine, on our own hosted vSphere cluster. The connection and credentials being used have been by installing Minio client locally, and we are able to list all buckets for both nc and oc instances.

Either a solution that enables us to give Chorus a certificate authority to trust, or an explanation of how to solve the "offline s3" issue would be appreciated.

arttor commented 1 month ago

Hi, unfortunately, self-signed certificates are not supported by chorus.

Some context: isSecure option is indicates s3 url schema: http for isSecure: false and https for isSecure: true. Schema is required to correctly work with s3 client libraries like rclone, and minio go sdk which are used by chorus.

To use self-signed certificates, we will have to provide your custom CA to these libraries or disable CA check there. Right now there is no option like disable-ca-check or ca-cert in chorus config so it is not possible to propagate this options to mentioned s3 client libraries (rclone, minio-go).

possible solutions:

  1. Use http.
  2. Use https with CA Certificate like letsencrypt.
  3. implement disable-ca-check option in chorus.

For option 3 you can adjust minio client config with custom transport and find a way how to set rclone --no-check-certificate option to rclone lib in chorus.

LewisSDAFNI commented 1 month ago

Hi @arttor ,

I had worried this would be the case, I can see that it is not straight forward to implement, but would be a useful feature.

For our production services, when it comes to that stage, we will be able to used option 2 with letsencrypt, and I can begin ti look at option 3 as it was not something I had considered before (will let you know how it goes).

However, for Option 1, this is something that we had already tried. We've altered the Kubernetes httproute and gateway to use a http address as opposed to https. When we try to use the agent docker-compose an error is reported from the worker

{"level":"error","error":"unable to create client for storage \"follower\": s3 is offline: The request signature we calculated does not match the signature you provided. Check your key and signing method.","time":"2024-07-30T08:01:25Z","message":"critical error. Shutdown application"}

A curl request such as curl -I http://minio-staging-s3api.flux-test.dafni.rl.ac.uk:80/minio/health/live -v is fine.

* Trying ...
* TCP_NODELAY set
* Connected to <...> port 80 (#0)
> HEAD <>health/live HTTP/1.1
> Host: <>
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< accept-ranges: bytes
< content-length: 0
< server: envoy
< strict-transport-security: max-age=31536000; includeSubDomains
< vary: Origin
< x-amz-id-2: <>
< x-amz-request-id: <>
< x-content-type-options: nosniff
< x-xss-protection: 1; mode=block
< date: Tue, 30 Jul 2024 08:02:55 GMT
< x-envoy-upstream-service-time: 0

Is there a suggestion that the httproute or gateway is misconfigured to use http address or that Chorus configuration has been done incorrectly?

arttor commented 1 month ago

this is strange. I think that the error is related to s3 auth signature, not tls. So the problem is that minio cannot validate credentials. Is there a chance that minio or chorus using wrong domain or ip to calculate signature?

I have not tested chorus with minio, only with aws, ceph, and fake-s3. But probably you can reproduce it with docker-compose with minio and chorus and post it here?

LewisSDAFNI commented 1 month ago

I can investigate this approach first, (the one you linked to around "using wrong domain..."), but first let me understand the suggestion. Would it be worth searching through the code in folder docker-compose for where the s3Client is defined and ensuring this is coded for the Url I'm using? I'm not an avid Go coder, but I should be able to attempt this.

A simple restart of the docker processes and prune of resources (nothing else is using this VM), still fails with the following output.

[+] Running 4/2
 ✔ Network docker-compose_default     Created                                                                                                                                                                                          0.3s
 ✔ Container docker-compose-redis-1   Created                                                                                                                                                                                          0.1s
 ✔ Container docker-compose-worker-1  Created                                                                                                                                                                                          0.0s
 ✔ Container docker-compose-agent-1   Created                                                                                                                                                                                          0.0s
Attaching to agent-1, redis-1, worker-1
redis-1   | 1:C 30 Jul 2024 10:09:04.474 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis-1   | 1:C 30 Jul 2024 10:09:04.474 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis-1   | 1:C 30 Jul 2024 10:09:04.474 * Redis version=7.2.5, bits=64, commit=00000000, modified=0, pid=1, just started
redis-1   | 1:C 30 Jul 2024 10:09:04.474 * Configuration loaded
redis-1   | 1:M 30 Jul 2024 10:09:04.474 * monotonic clock: POSIX clock_gettime
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * Running mode=standalone, port=6379.
redis-1   | 1:M 30 Jul 2024 10:09:04.475 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * Server initialized
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * Reading RDB base file on AOF loading...
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * Loading RDB produced by version 7.2.5
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * RDB age 411945 seconds
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * RDB memory usage when created 0.83 Mb
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * RDB is base AOF
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * Done loading RDB, keys loaded: 0, keys expired: 0.
redis-1   | 1:M 30 Jul 2024 10:09:04.475 * DB loaded from base file appendonly.aof.1.base.rdb: 0.000 seconds
redis-1   | 1:M 30 Jul 2024 10:09:04.476 * DB loaded from incr file appendonly.aof.1.incr.aof: 0.000 seconds
redis-1   | 1:M 30 Jul 2024 10:09:04.476 * DB loaded from append only file: 0.000 seconds
redis-1   | 1:M 30 Jul 2024 10:09:04.476 * Opening AOF incr file appendonly.aof.1.incr.aof on server start
redis-1   | 1:M 30 Jul 2024 10:09:04.476 * Ready to accept connections tcp
worker-1  | {"level":"info","time":"2024-07-30T10:09:05Z","message":"app config: reading default common config"}
worker-1  | {"level":"info","time":"2024-07-30T10:09:05Z","message":"app config: override with: worker_default_cfg"}
worker-1  | {"level":"info","time":"2024-07-30T10:09:05Z","message":"app config: override with: config/config.yaml"}
worker-1  | {"level":"info","time":"2024-07-30T10:09:05Z","message":"app config: override with: config/override.yaml"}
worker-1  | 2024-07-30T10:09:05Z INF ../build/service/worker/server.go:62 > app starting... app=worker app_id=cqkbngedpdu24p8o4gbg commit="not set" version=development
worker-1  | 2024-07-30T10:09:05Z INF ../build/service/worker/server.go:80 > app redis connected app=worker app_id=cqkbngedpdu24p8o4gbg
worker-1  | {"level":"error","error":"unable to create client for storage \"follower\": s3 is offline: The request signature we calculated does not match the signature you provided. Check your key and signing method.","time":"2024-07-30T10:09:05Z","message":"critical error. Shutdown application"}
agent-1   | {"level":"info","time":"2024-07-30T10:09:05Z","message":"app config: reading default common config"}
agent-1   | {"level":"info","time":"2024-07-30T10:09:05Z","message":"app config: override with: agent_default_cfg"}
agent-1   | {"level":"info","time":"2024-07-30T10:09:05Z","message":"app config: override with: config/config.yaml"}
agent-1   | {"level":"warn","time":"2024-07-30T10:09:05Z","message":"app config: no config file \"config/override.yaml\""}
agent-1   | 2024-07-30T10:09:05Z INF ../build/service/agent/server.go:49 > app starting... app=agent app_id=cqkbng858gblg9ch8n70 commit="not set" version=development
agent-1   | 2024-07-30T10:09:05Z INF ../build/service/agent/server.go:58 > redis app pool stats app=agent app_id=cqkbng858gblg9ch8n70 redis_pool={"Hits":0,"IdleConns":0,"Misses":0,"StaleConns":0,"Timeouts":0,"TotalConns":0} redis_pool_size=20
agent-1   | 2024-07-30T10:09:05Z INF ../build/service/agent/server.go:69 > app redis connected app=agent app_id=cqkbng858gblg9ch8n70
agent-1   | 2024-07-30T10:09:05Z INF ../build/service/agent/server.go:73 > redis conf pool stats app=agent app_id=cqkbng858gblg9ch8n70 redis_pool={"Hits":0,"IdleConns":1,"Misses":1,"StaleConns":0,"Timeouts":0,"TotalConns":1} redis_pool_size=20
agent-1   | 2024-07-30T10:09:05Z INF ../build/service/agent/server.go:82 > redis queue pool stats app=agent app_id=cqkbng858gblg9ch8n70 redis_pool_size=0
agent-1   | 2024-07-30T10:09:05Z INF ../build/service/agent/server.go:104 > agent created app=agent app_id=cqkbng858gblg9ch8n70
agent-1   | 2024-07-30T10:09:05Z INF ../build/pkg/util/serve.go:60 > server: start serving 2 workers app=agent app_id=cqkbng858gblg9ch8n70
agent-1   | 2024-07-30T10:09:05Z INF ../build/pkg/util/serve.go:97 > server: start serving app=agent app_id=cqkbng858gblg9ch8n70
agent-1   | 2024-07-30T10:09:05Z INF ../build/pkg/util/serve.go:67 > server: starting worker "agent_http" app=agent app_id=cqkbng858gblg9ch8n70
agent-1   | 2024-07-30T10:09:05Z INF ../build/pkg/util/serve.go:67 > server: starting worker "agent_request_reply" app=agent app_id=cqkbng858gblg9ch8n70
worker-1 exited with code 1

I have noted there are warnings (this has always been the case) from the redis component, but these didn't appear to be stopping any processes, and so were settled to be addressed when we came towards production setup (where more data will be being used).

arttor commented 1 month ago

as i can see from the log the problem is that chorus worker cannot execute healthcheck to configured s3:

worker-1 | {"level":"error","error":"unable to create client for storage \"follower\": s3 is offline: The request signature we calculated does not match the signature you provided. Check your key and signing method.","time":"2024-07-30T10:09:05Z","message":"critical error. Shutdown application"}

And error says that s3 auth signature is invalid.

worker s3 credentials for docker-compose should be defined in docker-compose/s3-credentials.yaml.

if you want to log exact values, you can add something like fmt.Println(conf) to this line.

LewisSDAFNI commented 1 month ago

Thank you very much for your help @arttor, I've been able to diagnose the issue.

With the breadth and wonders of the internet my typo can now be immortalised. (The issue was that I had written the incorrect credentials in the s3-credentials.yaml).

The suggestion for fmt.Println(conf) was the saviour for me, after rebuilding the image it was able print the the configuration and show which minio instance was failing to connection and the credentials it was attempting to use. Hopefully this will be the end of my worries for now. Thanks again!

arttor commented 1 month ago

i was thinking about an option allowing to print config at startup but in this case i would mask all credentials for security reasons anyway.

LewisSDAFNI commented 1 month ago

I am currently tinkering with configuration to keep the credentials in the s3-credentials.yaml file safe. However, if only root can view the file docker seems to be unable to configure.

What are the current best-practice suggestions for keeping the credential files secure and not easily readable?

arttor commented 1 month ago

please check out config docs. At the end you need to provide config as yaml file or envars to chorus binaries. you can override any yaml option with envar like this: CFG_REDIS_ADDRESS='127.0.0.1:6379' - envar name should start from CFG_ followed by yaml properties separated by _.

if you deploy chorus to k8s with helm chart it will use k8s secret to store credentials from confgis.

LewisSDAFNI commented 1 month ago

Thank you one more time Arttor, it has again solved my issues. I've created a .env file and had it linked in via the docker-compose.yaml file to bring in those envars, then I can set permissions on this .env as required.

I'm going to close this issue now as you've been able to solve all my issues.