Closed mpdude closed 4 years ago
You could try running just the fsfreeze
command in isolation and see if you can reproduce the issue outside of ec2-consistent-snapshot
.
That works, as almost all snapshots do.
It‘s just one in 500 or thousand that fails :-(
Wow, you've got some scale :) At that frequency, I wonder if there are some hiccups in the EBS storage system. It would be nice in that case if there was a more graceful solution than just hanging though. At this point, I'm not sure what else ec2-consistent-snapshot
could do to help. The Perl couldn't be freezing the file system, it must be fsfreeze
getting stuck. Are you using ext4
or another filesystem?
No, ain’t that big. Just make snapshots every other hour on a dozen machines and wait a few weeks :-)
I am using xfs.
Is EBS really involved at that time? My assumption was that freezing the FS was putting a lock or something in place, nothing fancy that would not even go out to EBS...
I recommend adding a command line unfreeze command that runs every time after the ec2-consistent-snapshot command completes. This way it gets freed up no matter what state it is in when the program terminates.
Looks the command would be fsfreeze --unfreeze $mount_point
. From local testing, it looks we need to ignore the return value, as running --unfreeze
on a non-frozen file system generates an error (but otherwise seems safe to do).
@ehammond were you suggesting that ec2-consistent-snapshot
be patched along those lines, or that users should use this pattern in their own scripts?
What a simple yet effective solution, should have thought of that myself.
My understanding is that this needs to happen outside ec2-consistent-snapshot
, as we're dealing with the (remotely possible?) case that the script fails in a way that does not call the cleanup/shutdown hooks.
As I had a wrapper script anyway that was called from cron, I simply added
/sbin/fsfreeze -u $MOUNTPOINT 2>/dev/null || true
@mpdude Looks good to me, although I'm more likely to spell out flags in scripts like --unfreeze
. Scripts are written once but read many times and the extra clarity may help a future reader who doesn't immediately remember or recognize what the -u
flag would do.
Thanks to all involved for your suggestions!
Yep. Just make sure the unfreeze gets run even if the previous program fails. For example special care must be taken if it's a bash script with the -e flag that exits on any error.
I think we did all we can do to catch program exiting and unfreeze inside the program.
Reopening this as the problem has occured twice since I added the additional unfreeze attempt in my invoking script :-(.
I've added an additional log message at the end of run_command
so that we can hopefully see if the programs were executed successfully.
Like before, the last successful execution of ec2-consistent-snapshot
terminated with no problems, including unfreezing the filesystem, unlocking MySQL etc. The next invocation logs sync
and /sbin/fsfreeze -f /vol
afterwards, where execution seems to stop/hang/terminate (no additional output).
Would you say that if sync
worked OK it is safe to assume that the filesystem was not frozen at that point?
Or should I add some safeguards to make sure the filesystem is not frozen before I even start ec2-consistent-snapshot
? https://stackoverflow.com/questions/10096183/how-do-you-determine-if-an-xfs-filesystem-is-frozen-programmatically shows a hack how this might be accomplished.
I tried to run ec2-consistent-snapshot
after manually freezing the filesystem before.
Here's what happens:
ec2-consistent-snapshot: Using AWS access key: AKIAIKPVFE3YOBXAIHBQ
ec2-consistent-snapshot: Fri Nov 2 10:55:03 2018: No volume ids specified; discovering volume ids
ec2-consistent-snapshot: Fri Nov 2 10:55:03 2018: Discovering volume ids for: /vol
ec2-consistent-snapshot: Fri Nov 2 10:55:03 2018: Determining instance id
ec2-consistent-snapshot: Fri Nov 2 10:55:03 2018: create EC2 object
ec2-consistent-snapshot: Endpoint: https://ec2.eu-west-1.amazonaws.com
ec2-consistent-snapshot: Fri Nov 2 10:55:03 2018: Fetching instance description for i-d5eaca5e
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: Found EBS block devices for i-d5eaca5e:
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: vol-cea5d03d /dev/sda1
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: vol-8f36417c /dev/sdf
ec2-consistent-snapshot: Using description 'ec2-consistent-snapshot /vol on i-d5eaca5e (mafalde)' for all snapshot descriptions
ec2-consistent-snapshot: Using tag 'Name=mafalde /vol;host=mafalde;mount=/vol' for all snapshot tags
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: MySQL connect as debian-sys-maint
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: MySQL flush
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: MySQL flush & lock
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: executing 'sync'
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: executing 'fsfreeze -f /vol'
fsfreeze: /vol: freeze failed: Device or resource busy
ec2-consistent-snapshot: ERROR: fsfreeze -f /vol: failed(256)
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: create EC2 object
ec2-consistent-snapshot: Endpoint: https://ec2.eu-west-1.amazonaws.com
ec2-consistent-snapshot: volume_id: vol-8f36417c; description: ec2-consistent-snapshot /vol on i-d5eaca5e (mafalde)
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: aws ec2 create-snapshot vol-8f36417c
snap-046e08b3c10b6c86f
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: executing 'fsfreeze -u /vol'
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: MySQL unlock
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: MySQL disconnect
ec2-consistent-snapshot: Fri Nov 2 10:55:04 2018: done
So, even when the fsfreeze
call fails, the script would try to continue. I don't know if this is intended or a bug in error handling. However, this still suggests that either the script terminates when trying to call fsfreeze
, or the system()
call never returns. 🤷♂️
I had other problems with this tool and ended up porting ec2-consistent-snapshot to bash
and simplifying it. You can find the result here:
https://github.com/RideAmigosCorp/ec2-consistent-snapshot.sh
One thing you'll see in the source is that we "trap" interrupts and errors and make sure to unfreeze
in those cases as well. As part of switching to this tool, I'm not longer helping to maintain the Perl version.
The Perl version depends on a stack that is not supported by AWS, while the bash version uses the official aws
CLI, which is supported by AWS.
@ehammond Is the END
block in Perl really powerful enough to catch interrupts and/or signals, or does it deal only with “soft” errors like exceptions?
@mpdude No need to ask @ehammond, the official docs on END are clear enough on this point:
An END code block is executed as late as possible, that is, after perl has finished running the program and just before the interpreter is being exited, even if it is exiting as a result of a die() function. (But not if it's morphing into another program via exec, or being blown out of the water by a signal--you have to trap that yourself (if you can).) You may have multiple END blocks within a file--they will execute in reverse order of definition; that is: last in, first out (LIFO). END blocks are not executed when you run perl with the -c switch, or if compilation fails.
So no, it won't catch signals, like SIGINT or SIGTERM.
The Perl script has some signal handlers installed that do convert the caught signals into a die
which would be caught by the END block. Although the signal handlers work globally, they are only setup currently if the MySQL option is used, as seen here:
https://github.com/alestic/ec2-consistent-snapshot/blob/master/ec2-consistent-snapshot#L205
So the code definitely does have the possibility of failing to clean-up and unfreeze if it receives a signal without the MySQL option being set.
I don't think the Bash port has that problem, it unconditionally sets up a signal "trap" before starting. Perhaps there are additional signals that should be trapped, though. Patches welcome.
https://github.com/RideAmigosCorp/ec2-consistent-snapshot.sh/blob/master/ec2-consistent-snapshot
Thanks for submitting this. Unfortunately, this project is no longer under development in this repo. Anybody is welcome to fork the project and continue development if there is interest.
I repeatedly had to intervene on machines because of a locked file system that caused lots of processes stuck in the
D
(waiting for disk I/O) state.I found that before the problems started, a cron-scheduled run of
ec2-consistent-snapshot
left the following in the log.This aborts in the middle of the execution, i. e. no additional messages are logged by this process afterwards.
My assumption is that something goes so terribly wrong that the program dies/exits and leaves a frozen file system behind.
I am not writing Perl myself, but from looking at the code it seems that care has been taken to make sure the file system will be thawed when the program exits.
This safeguard does probably not help when the script is killed by a signal, in particular by the OOM killer. But I have checked my logs and cannot find any evidence that this happened.
Another observation is that the call to
fsfreeze
is actually logged before it is executed. That leaves a slight chance that the freeze never actually succeeded, maybe becausefsfreeze
itself was stuck forever? I don't know if this is possible, for example under heavy disk I/O – but even if so, then something else must have frozen the fs.Any ideas what might cause this or how I could gather more details when this happens the next time?
I am using the 0.68 version as shipped in Ubuntu.