Open stefansundin opened 5 years ago
@stefansundin I think that I have run into the same issue as you regarding the flannel issue when restoring and I haven't been able to recover my cluster manually. I like this improvement you have made here and I am probably going to switch to your branch.
You mentioned that you have made some other improvements but have not incorporated it into this PR. Would you be keen to push those improvements to a branch on your repository so that I could have a look? I think that it would benefit a lot of people.
Thanks!
"Nice" to see that someone else had the same problem. I thought that I was doing something wrong, and I think that contributed to it taking so long for me to figure out the actual problem. I don't think you can recover your cluster, start over with a proper pki backup. :)
As for my other changes, I used this project for inspiration, and integrated similar Terraform code into an existing codebase that I have. So the code that I use has never actually been compatible with this project. I backported the important changes for the purposes of this PR, but the other enhancements that I made are described in the bullet list above.
I guess some other changes I've made are:
s3://${s3bucket}/pki.tar.xz
, I store it at s3://${s3bucket}/pki/${clustername}.tar.xz
, and the same with the etcd backups./etc/fstab
with a UUID
parameter. I have not tested the m1.medium, but from what I can discern here is that it automatically mounts the ephemeral storage. On r5d, this does not appear to be the case, and I have to mount it manually. My code is not very portable, and looks for the device based on size, so I should improve this code and share it. On r5d and similar instances, the NVMe devices in /dev/
may change order when you reboot, so it is important to use UUID here.There are also many things that I have not tested yet as well. For example, I am not using Helm yet. I am not using --cloud-provider=aws
either, since so far I don't rely on EBS volumes for persistence (I will soon though). I am trying to start from scratch and use this project as a guide, in order to learn as much as possible. This project has been very useful and educational, so a big thanks to all the authors and contributors.
Hi there. I have used parts of this project and modified it beyond recognition.
But I felt it necessary to contribute back some of my improvements. Mostly because of the pki issue I encountered and filed here: https://github.com/cablespaghetti/kubeadm-aws/issues/13
And also because the S3 storage costs can be lowered dramatically by compressing them first, and using versioning instead of timestamps in the filename to keep older backups. This way if the user wants to minimize costs, then versioning should be left disabled. There is no need to keep more than the latest etcd snapshot around. And no need to backup the pki data more than once.
And not to mention, the restoration code had a bug where an outdated snapshot would be restored if the backup interval was increased. That is a huge bug, although it won't be encountered by most people using this project. 1000 objects are returned by default from
aws s3api list-objects
, and about 700 objects would be created if when taking backups every 15 minutes and deleting them after 7 days. The most recent backup is at the very end of the list-objects API response, since it's always sorted alphabetically.In my opinion, the
backup-enabled
variable should be removed, and thenversioned-bucket
can be used to save money. The cost of keeping a single backup around should be very low (especially now with compression). Self-healing should be a cornerstone of this project, and is impossible without a backup.And please test these changes, since I copy-pasted them back from my changed version, and I haven't done a lot of testing on the backported version.
I have made some other improvements that you may want to incorporate, but I didn't include them in this PR: