Open bonyfusolia opened 3 years ago
I have not actually tried doing this, so I am not sure. Feel free to give it a shot!
Hi @luanphantiki
thanks for your PR.
I could build the container image from the Dockerfile and have created a deployment configuration for a k8s cluster. Vault is configured with an according app_role, thesnapshot.json
is mounted to the Pod at the expected path.
When starting the Pod the log output shows:
Not running on leader node, skipping.
Question: Is it necessary to run the container as a side-car to each existing k8s vault pod or is it possible to tell the snapshotter to detect the leader?
Thanks for your help! Cheers.
@devops-42
Not running on leader node, skipping.
-> This message showed that it came from a Follower Pod. How many Vault's pod do you have? Let's focus on Leader pod's log
Question: Is it necessary to run the container as a side-car to each existing k8s vault pod or is it possible to tell the snapshotter to detect the leader?
-> It's not required to run as a side-car, you can use another kubernetes deployment with the correct value: "addr":"http://vaul-leader.svc:8200"
Btw, show your snapshot.json
file can help me understand what you have.
@luanphantiki
Thanks for clarification, I changed the address to the internal svc
address of the leading pod (have a cluster of 3 pods deployed). Now the snapshot provider tries to perform something, the logs say:
2021/10/26 08:03:41 Reading configuration...
2021/10/26 08:04:41 Unable to generate snapshot context deadline exceeded (Client.Timeout or context cancellation while reading body)
My snapshot.json
is:
{
"addr":"http://leader-adress:8200,
"retain":72,
"frequency":"3600s",
"role_id": "***",
"secret_id":"***",
"aws_storage":{
"access_key_id":"***",
"secret_access_key":"***",
"s3_region":"us-east-1",
"s3_bucket":"**bucket**",
"s3_endpoint":"**s3_endpoint**",
"s3_force_path_style":true
}
}
What could be wrong here?
@devops-42 can you try to update addr from :
"addr":"http://leader-adress:8200,
to:
"addr":"http://leader-adress:8200",
If the issue stills same, try to validate connectivity by getting the shell exec to backup pod and run:
curl -Ik http://vault-leader:8200
And show me the output.
My bad, when cleaning up the config file I accidentally deleted the "
Concerning the curl
call: I got a 307 status code:
HTTP/1.1 307 Temporary Redirect
Cache-Control: no-store
Content-Type: text/html; charset=utf-8
Location: /ui/
Date: Tue, 26 Oct 2021 09:07:30 GMT
@devops-42 Then finally make sure that you're using Raft as storage, is that correct? Can you show your vault's config ?
@luanphantiki
I do use Raft as storage, here's a redacted output of the vault status
command
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 5
Threshold 3
Version 1.8.1
Storage Type raft
Cluster Name vault-cluster-******
Cluster ID ********-****-****-****-************
HA Enabled true
HA Cluster https://***********:8201
HA Mode active
Active Since YYYY-MM-DDTHH:MM:SS.123456789Z
Raft Committed Index *******
Raft Applied Index *******
@devops-42 Alright, Let's rerun the backup pod, does it work ? You should tail the vault's logs to see if there is any clue
@luanphantiki
It seems that the pod can connect to the leader pod of the vault, the log output of the leader is as follows:
2021-10-26T09:19:43.036Z [INFO] storage.raft: starting snapshot up to: index=*******
2021-10-26T09:19:43.045Z [INFO] storage.raft: compacting logs: from=******* to=*******
2021-10-26T09:19:43.081Z [INFO] storage.raft: snapshot complete up to: index=*******
But when checking the local filesystem of the Pod (which as a PVC attached) no snapshot file has been created.
Any change to configure more debugging in the backup pod?
@devops-42: Can you check your S3? Any new output from backup pod ?
@luanphantiki
The backup pod error message stays the same. I could successfully connect from the backup pod to the S3 endpoint (we used minio) via nc
:
ip.add.ress.minio (ip.add.ress.minio:9000) open
So I assume that my network setup is correct.
@devops-42 : I haven't tried with Minio on this project and not sure if the current lib (https://github.com/aws/aws-sdk-go/tree/main/service/s3/s3manager) can support minio. @Lucretius can you pls confirm that ?
Anw, I guess you should replace MinoS3's config from snapshot.json
by local directive to let backup pod can write data to local file system as work around:
{
...
"local_storage": {
"path": "/path/to/pvc/"
}
...
}
@luanphantiki
at first, thanks for your patience :)
I started a debug pod to play around with configuration and the binary. Tried to perform a backup using this (redacted) configuration:
{
"addr":"http://vault-leader:8200",
"retain":72,
"frequency":"3600s",
"role_id": "******",
"secret_id":"******",
"local_storage":{
"path": "/tmp"
}
}
The config file is located below /tmp/snapshot.json
. Have started the snapshotter:
~ $ /vault_raft_snapshot_agent /tmp/snapshot.json
2021/10/26 09:49:23 Reading configuration...
2021/10/26 09:50:23 Unable to generate snapshot context deadline exceeded (Client.Timeout or context cancellation while reading body)
The according log output of the vault leader is:
2021-10-26T09:49:23.857Z [INFO] storage.raft: starting snapshot up to: index=*******
2021-10-26T09:49:23.862Z [INFO] storage.raft: compacting logs: from=******* to=*******
2021-10-26T09:49:23.870Z [INFO] storage.raft: snapshot complete up to: index=*******
Seems to be an issue with the communication with the vault leader.
@devops-42 : Agreed, it failed at snapshot step.
@luanphantiki Ok. Is there by any chance a possibility to get more information why this step fails? It seems to be a timeout value, after 60 secs the snapshot attempt aborts with that error.
@devops-42 unfortunately, this part returned by the vault-api sdk, there is no more details to see.
I have also reproduced your configuration from my side and there is no issue
/ # vi backup.json
/ # /vault_raft_snapshot_agent backup.json
2021/10/26 11:14:07 Reading configuration...
2021/10/26 11:14:07 Successfully created local snapshot to /tmp/raft_snapshot-1635246847792284547.snap
/ # cat backup.json
{
"addr":"http://vault-leader.svc:8200",
"retain":72,
"frequency":"3600s",
"role_id": "*******",
"secret_id":"***********",
"local_storage":{
"path": "/tmp"
}
}
/ # du -sh /tmp/raft_snapshot-1635246847792284547.snap
24.0K /tmp/raft_snapshot-1635246847792284547.snap
@luanphantiki Ok. Is there by any chance a possibility to get more information why this step fails? It seems to be a timeout value, after 60 secs the snapshot attempt aborts with that error.
Simply a connectivity issue though.
@luanphantiki
The problem could be related with the size of the vault.db. My vault.db file is currently over 2 GB. I checked whether there's a timeout issue when creating the snapshot by curl
:
curl --header "X-Vault-Token: ..." --request GET http://vault-leader:8200/v1/sys/storage/raft/snapshot > /tmp/raft.snap
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1468M 0 1468M 0 0 8610k 0 --:--:-- 0:02:54 --:--:-- 10.1M
$ ls -lh /tmp
total 1.5G
-rw-rw-rw-. 1 1000730000 root 1.5G Oct 26 11:25 raft.snap
@Lucretius Does the command from the vault_snapshot_agent has any built-in timeout?
@devops-42 seems to be the valid issue, but we should move this conversation to #20
@luanphantiki You're absolutely right. Thx for your help!
Hi,
Can we run this agent inside a Kubernetes cluster?