Lucretius / vault_raft_snapshot_agent

⛔️ DEPRECATED ⛔️ An agent which provides periodic snapshotting capabilities of Vault's Raft backend
MIT License
78 stars 42 forks source link

Kubernetes #9

Open bonyfusolia opened 3 years ago

bonyfusolia commented 3 years ago

Hi,

Can we run this agent inside a Kubernetes cluster?

Lucretius commented 3 years ago

I have not actually tried doing this, so I am not sure. Feel free to give it a shot!

devops-42 commented 2 years ago

Hi @luanphantiki

thanks for your PR.

I could build the container image from the Dockerfile and have created a deployment configuration for a k8s cluster. Vault is configured with an according app_role, thesnapshot.json is mounted to the Pod at the expected path. When starting the Pod the log output shows:

Not running on leader node, skipping.

Question: Is it necessary to run the container as a side-car to each existing k8s vault pod or is it possible to tell the snapshotter to detect the leader?

Thanks for your help! Cheers.

luanphantiki commented 2 years ago

@devops-42

Not running on leader node, skipping.

-> This message showed that it came from a Follower Pod. How many Vault's pod do you have? Let's focus on Leader pod's log

Question: Is it necessary to run the container as a side-car to each existing k8s vault pod or is it possible to tell the snapshotter to detect the leader?

-> It's not required to run as a side-car, you can use another kubernetes deployment with the correct value: "addr":"http://vaul-leader.svc:8200"

Btw, show your snapshot.json file can help me understand what you have.

devops-42 commented 2 years ago

@luanphantiki

Thanks for clarification, I changed the address to the internal svc address of the leading pod (have a cluster of 3 pods deployed). Now the snapshot provider tries to perform something, the logs say:

2021/10/26 08:03:41 Reading configuration...
2021/10/26 08:04:41 Unable to generate snapshot context deadline exceeded (Client.Timeout or context cancellation while reading body)

My snapshot.json is:

{
   "addr":"http://leader-adress:8200,
   "retain":72,
   "frequency":"3600s",
   "role_id": "***",
   "secret_id":"***",
   "aws_storage":{
      "access_key_id":"***",
      "secret_access_key":"***",
      "s3_region":"us-east-1",
      "s3_bucket":"**bucket**",
      "s3_endpoint":"**s3_endpoint**",
      "s3_force_path_style":true
   }
}

What could be wrong here?

luanphantiki commented 2 years ago

@devops-42 can you try to update addr from :

 "addr":"http://leader-adress:8200,

to:

 "addr":"http://leader-adress:8200",

If the issue stills same, try to validate connectivity by getting the shell exec to backup pod and run:

curl -Ik http://vault-leader:8200

And show me the output.

devops-42 commented 2 years ago

My bad, when cleaning up the config file I accidentally deleted the "

Concerning the curl call: I got a 307 status code:

HTTP/1.1 307 Temporary Redirect
Cache-Control: no-store
Content-Type: text/html; charset=utf-8
Location: /ui/
Date: Tue, 26 Oct 2021 09:07:30 GMT
luanphantiki commented 2 years ago

@devops-42 Then finally make sure that you're using Raft as storage, is that correct? Can you show your vault's config ?

devops-42 commented 2 years ago

@luanphantiki

I do use Raft as storage, here's a redacted output of the vault status command

Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            5
Threshold               3
Version                 1.8.1
Storage Type            raft
Cluster Name            vault-cluster-******
Cluster ID              ********-****-****-****-************
HA Enabled              true
HA Cluster              https://***********:8201
HA Mode                 active
Active Since            YYYY-MM-DDTHH:MM:SS.123456789Z
Raft Committed Index    *******
Raft Applied Index      *******
luanphantiki commented 2 years ago

@devops-42 Alright, Let's rerun the backup pod, does it work ? You should tail the vault's logs to see if there is any clue

devops-42 commented 2 years ago

@luanphantiki

It seems that the pod can connect to the leader pod of the vault, the log output of the leader is as follows:

2021-10-26T09:19:43.036Z [INFO]  storage.raft: starting snapshot up to: index=*******
2021-10-26T09:19:43.045Z [INFO]  storage.raft: compacting logs: from=******* to=*******
2021-10-26T09:19:43.081Z [INFO]  storage.raft: snapshot complete up to: index=*******

But when checking the local filesystem of the Pod (which as a PVC attached) no snapshot file has been created.

Any change to configure more debugging in the backup pod?

luanphantiki commented 2 years ago

@devops-42: Can you check your S3? Any new output from backup pod ?

devops-42 commented 2 years ago

@luanphantiki

The backup pod error message stays the same. I could successfully connect from the backup pod to the S3 endpoint (we used minio) via nc:

ip.add.ress.minio (ip.add.ress.minio:9000) open

So I assume that my network setup is correct.

luanphantiki commented 2 years ago

@devops-42 : I haven't tried with Minio on this project and not sure if the current lib (https://github.com/aws/aws-sdk-go/tree/main/service/s3/s3manager) can support minio. @Lucretius can you pls confirm that ?

Anw, I guess you should replace MinoS3's config from snapshot.json by local directive to let backup pod can write data to local file system as work around:

{
...
"local_storage": {
  "path": "/path/to/pvc/"
}
...
}
devops-42 commented 2 years ago

@luanphantiki

at first, thanks for your patience :)

I started a debug pod to play around with configuration and the binary. Tried to perform a backup using this (redacted) configuration:

{
   "addr":"http://vault-leader:8200",
   "retain":72,
   "frequency":"3600s",
   "role_id": "******",
   "secret_id":"******",
   "local_storage":{
    "path": "/tmp"
   }
}

The config file is located below /tmp/snapshot.json. Have started the snapshotter:

~ $ /vault_raft_snapshot_agent /tmp/snapshot.json 
2021/10/26 09:49:23 Reading configuration...
2021/10/26 09:50:23 Unable to generate snapshot context deadline exceeded (Client.Timeout or context cancellation while reading body)

The according log output of the vault leader is:

2021-10-26T09:49:23.857Z [INFO]  storage.raft: starting snapshot up to: index=*******
2021-10-26T09:49:23.862Z [INFO]  storage.raft: compacting logs: from=******* to=*******
2021-10-26T09:49:23.870Z [INFO]  storage.raft: snapshot complete up to: index=*******

Seems to be an issue with the communication with the vault leader.

luanphantiki commented 2 years ago

@devops-42 : Agreed, it failed at snapshot step.

devops-42 commented 2 years ago

@luanphantiki Ok. Is there by any chance a possibility to get more information why this step fails? It seems to be a timeout value, after 60 secs the snapshot attempt aborts with that error.

luanphantiki commented 2 years ago

@devops-42 unfortunately, this part returned by the vault-api sdk, there is no more details to see.

I have also reproduced your configuration from my side and there is no issue

/ # vi backup.json
/ # /vault_raft_snapshot_agent backup.json
2021/10/26 11:14:07 Reading configuration...
2021/10/26 11:14:07 Successfully created local snapshot to /tmp/raft_snapshot-1635246847792284547.snap

/ # cat backup.json
{
   "addr":"http://vault-leader.svc:8200",
   "retain":72,
   "frequency":"3600s",
   "role_id": "*******",
   "secret_id":"***********",
   "local_storage":{
    "path": "/tmp"
   }
}
/ # du -sh /tmp/raft_snapshot-1635246847792284547.snap
24.0K   /tmp/raft_snapshot-1635246847792284547.snap
luanphantiki commented 2 years ago

@luanphantiki Ok. Is there by any chance a possibility to get more information why this step fails? It seems to be a timeout value, after 60 secs the snapshot attempt aborts with that error.

Simply a connectivity issue though.

devops-42 commented 2 years ago

@luanphantiki

The problem could be related with the size of the vault.db. My vault.db file is currently over 2 GB. I checked whether there's a timeout issue when creating the snapshot by curl:

curl --header "X-Vault-Token: ..." --request GET http://vault-leader:8200/v1/sys/storage/raft/snapshot > /tmp/raft.snap
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1468M    0 1468M    0     0  8610k      0 --:--:--  0:02:54 --:--:-- 10.1M
$ ls -lh /tmp
total 1.5G
-rw-rw-rw-. 1 1000730000 root 1.5G Oct 26 11:25 raft.snap

@Lucretius Does the command from the vault_snapshot_agent has any built-in timeout?

luanphantiki commented 2 years ago

@devops-42 seems to be the valid issue, but we should move this conversation to #20

devops-42 commented 2 years ago

@luanphantiki You're absolutely right. Thx for your help!