chef / knife-ec-backup

Backup and restore Chef Infra Server in a repository-compatible format
Apache License 2.0
31 stars 28 forks source link

Exceptions should be handled, especially where it makes sense to attempt a retry #19

Open strickra opened 10 years ago

strickra commented 10 years ago

We are seeing multiple unhandled exceptions, which are causing knife ec backup activity to be more fragile than is necessary. The big one we keep seeing is that if --concurrency is > 1, the backup will run for a few minutes and then abort like so:

Created /cookbooks/rabbitmq-0.0.1
Created /cookbooks/nova-0.6.26/templates/default/dashboard.apache.erb
ERROR: internal server error
Response: #<Net::ReadAdapter:0x00000002fd3c28>

It never aborts on the same file, or after the same amount of time has elapsed, but it's usually between two and four minutes.

When we set --concurrency to 1, the backup ran fine for 22 hours, then encountered the following problem and aborted:

Created /acls/roles/build_slave.json
Created /acls/organization.json
Grabbing organization personal-darragh ...
Created /acls
Created /acls/groups
Created /acls/groups/billing-admins.json
Created /groups
Created /groups/billing-admins.json
ERROR: ArgumentError: Cannot sign the request without a client name, check that :node_name is assigned

Although there is some evidence that restarting a failed backup is a supported thing (seems to skip some already-downloaded content, update other objects where they have been changed), it is not entirely complete:

Created /cookbooks/swift-0.0.19/templates/default/cron.d/swift-container-stats-log-creator.erb
Created /cookbooks/swift-0.0.19/templates/default/rsyslog.d/40-swift-object.conf.erb
Created /cookbooks/swift-0.0.19/files/default/systest/ring/account.builder
Created /cookbooks/swift-0.0.19/files/default/systest/ring/container.builder
ERROR: Errno::EEXIST: File exists - /home/strickra/projects/chef11/xfer/aw1/organizations/aw1-ops/cookbooks/icinga-0.3.8

Taken together, it is catastrophic.

strickra commented 10 years ago

We have resolved the "Cannot sign the request without a client name" error. It was due to an organization which had an Admins group with no users in it.

strickra commented 10 years ago

Here's a fun new iteration on the theme of busted chef orgs.

Grabbing organization testorg ...
Created /acls
Created /acls/groups
Created /acls/groups/billing-admins.json
Created /groups
Created /groups/billing-admins.json
Created /groups/admins.json
ERROR: ChefFS::FileSystem::OperationFailedError: HTTP error retrieving children: 403 "Forbidden"

Note, 'admins' group has no acls, thus no permissions.

strickra commented 10 years ago

Actually I'm not sure what to make of the above. There were other orgs that didn't have /acls/groups/admins.json but otherwise were backed up okay. In the process of using orgmapper to grant myself permission to see "testorg", we stopped being able to reproduce the error.

strickra commented 10 years ago

The Errno::EEXIST problem reported earlier can be worked around by doing this before restarting the backup:

find backup-dir/ -type d -empty -print | xargs rmdir
stevendanna commented 10 years ago

I'm increasingly of the opinion that we should try to restore as much as possible even if a given org fails. If one org fails, we could swallow the error, move on to the next, and then print out a report at the end. Thoughts?

strickra commented 10 years ago

That might be helpful, though you mention restores and I was reporting problems specifically with backup. It is definitely the case that for backups that take 20+ hours to run to completion, getting to an org with an error (like I keep seeing where there are no members in the admins group for backup to switch to) takes a long time, then you fix it in two minutes, then wait a long time again to see that it worked, wait longer, then discover there's another org busted the same way and have to start over AGAIN -- this process is pretty clunky, and having the backup move to the next org and print a report at the end would certainly improve things here.

I wouldn't expect things like server timeouts or transient network transport/socket issues like I had a run of where a socket couldn't be opened locally ("cannot assign requested address") should skip the org but rather hang on and retry, or similarly with the EEXIST problem on empty directories, it seems like that should be basically just ignored.

jkeiser-oc commented 10 years ago

I agree. As much as possible, errors should be obvious (you should see them) but if we can move on and do more, we should.

On Thu, Feb 6, 2014 at 11:28 AM, strickra notifications@github.com wrote:

That might be helpful, though you mention restores and I was reporting problems specifically with backup. It is definitely the case that for backups that take 20+ hours to run to completion, getting to an org with an error (like I keep seeing where there are no members in the admins group for backup to switch to) takes a long time, then you fix it in two minutes, then wait a long time again to see that it worked, wait longer, then discover there's another org busted the same way and have to start over AGAIN -- this process is pretty clunky.

I wouldn't expect things like server timeouts or transient network transport/socket issues like I had a run of where a socket couldn't be opened locally ("cannot assign requested address") should not skip the org but rather hang on and retry, or similarly with the EEXIST problem on empty directories, it seems like that should be basically just ignored.

Reply to this email directly or view it on GitHubhttps://github.com/opscode/knife-ec-backup/issues/19#issuecomment-34360601 .

stevendanna commented 9 years ago

This is still an issue even on the 2.0 refactor branch. I'm going to leave this issue open since I think we can probably make the situation better for the use case of long backups.