cablespaghetti / kubeadm-aws

Really cheap Kubernetes cluster on AWS with kubeadm
Other
865 stars 59 forks source link

Backup improvements #14

Open stefansundin opened 5 years ago

stefansundin commented 5 years ago

Hi there. I have used parts of this project and modified it beyond recognition.

But I felt it necessary to contribute back some of my improvements. Mostly because of the pki issue I encountered and filed here: https://github.com/cablespaghetti/kubeadm-aws/issues/13

And also because the S3 storage costs can be lowered dramatically by compressing them first, and using versioning instead of timestamps in the filename to keep older backups. This way if the user wants to minimize costs, then versioning should be left disabled. There is no need to keep more than the latest etcd snapshot around. And no need to backup the pki data more than once.

And not to mention, the restoration code had a bug where an outdated snapshot would be restored if the backup interval was increased. That is a huge bug, although it won't be encountered by most people using this project. 1000 objects are returned by default from aws s3api list-objects, and about 700 objects would be created if when taking backups every 15 minutes and deleting them after 7 days. The most recent backup is at the very end of the list-objects API response, since it's always sorted alphabetically.

In my opinion, the backup-enabled variable should be removed, and then versioned-bucket can be used to save money. The cost of keeping a single backup around should be very low (especially now with compression). Self-healing should be a cornerstone of this project, and is impossible without a backup.

And please test these changes, since I copy-pasted them back from my changed version, and I haven't done a lot of testing on the backported version.

I have made some other improvements that you may want to incorporate, but I didn't include them in this PR:

kurtmc commented 5 years ago

@stefansundin I think that I have run into the same issue as you regarding the flannel issue when restoring and I haven't been able to recover my cluster manually. I like this improvement you have made here and I am probably going to switch to your branch.

You mentioned that you have made some other improvements but have not incorporated it into this PR. Would you be keen to push those improvements to a branch on your repository so that I could have a look? I think that it would benefit a lot of people.

Thanks!

stefansundin commented 5 years ago

"Nice" to see that someone else had the same problem. I thought that I was doing something wrong, and I think that contributed to it taking so long for me to figure out the actual problem. I don't think you can recover your cluster, start over with a proper pki backup. :)

As for my other changes, I used this project for inspiration, and integrated similar Terraform code into an existing codebase that I have. So the code that I use has never actually been compatible with this project. I backported the important changes for the purposes of this PR, but the other enhancements that I made are described in the bullet list above.

I guess some other changes I've made are:

There are also many things that I have not tested yet as well. For example, I am not using Helm yet. I am not using --cloud-provider=aws either, since so far I don't rely on EBS volumes for persistence (I will soon though). I am trying to start from scratch and use this project as a guide, in order to learn as much as possible. This project has been very useful and educational, so a big thanks to all the authors and contributors.