Program aborts, terminates or fails (?) and leaves frozen filesystem behind

mpdude commented 6 years ago

I repeatedly had to intervene on machines because of a locked file system that caused lots of processes stuck in the D (waiting for disk I/O) state.

I found that before the problems started, a cron-scheduled run of ec2-consistent-snapshot left the following in the log.

ec2-consistent-snapshot: Using AWS access key: ...
ec2-consistent-snapshot: Sat Mar 17 14:33:04 2018: No volume ids specified; discovering volume ids
ec2-consistent-snapshot: Sat Mar 17 14:33:04 2018: Discovering volume ids for: /vol
ec2-consistent-snapshot: Sat Mar 17 14:33:04 2018: Determining instance id
ec2-consistent-snapshot: Sat Mar 17 14:33:04 2018: create EC2 object
ec2-consistent-snapshot: Endpoint: https://ec2.eu-west-1.amazonaws.com
ec2-consistent-snapshot: Sat Mar 17 14:33:04 2018: Fetching instance description for i-...
ec2-consistent-snapshot: Sat Mar 17 14:33:05 2018: Found EBS block devices for i-...: 
ec2-consistent-snapshot: Sat Mar 17 14:33:05 2018:     vol-... /dev/sda1
ec2-consistent-snapshot: Sat Mar 17 14:33:05 2018:     vol-... /dev/sdf
ec2-consistent-snapshot: Using description '...' for all snapshot descriptions
ec2-consistent-snapshot: Using tag '...' for all snapshot tags
ec2-consistent-snapshot: Sat Mar 17 14:33:05 2018: MySQL connect as debian-sys-maint
ec2-consistent-snapshot: Sat Mar 17 14:33:05 2018: MySQL flush
ec2-consistent-snapshot: Sat Mar 17 14:33:05 2018: MySQL flush & lock
ec2-consistent-snapshot: Sat Mar 17 14:33:05 2018: sync
ec2-consistent-snapshot: Sat Mar 17 14:33:05 2018: /sbin/fsfreeze -f /vol

This aborts in the middle of the execution, i. e. no additional messages are logged by this process afterwards.

My assumption is that something goes so terribly wrong that the program dies/exits and leaves a frozen file system behind.

I am not writing Perl myself, but from looking at the code it seems that care has been taken to make sure the file system will be thawed when the program exits.

This safeguard does probably not help when the script is killed by a signal, in particular by the OOM killer. But I have checked my logs and cannot find any evidence that this happened.

Another observation is that the call to fsfreeze is actually logged before it is executed. That leaves a slight chance that the freeze never actually succeeded, maybe because fsfreeze itself was stuck forever? I don't know if this is possible, for example under heavy disk I/O – but even if so, then something else must have frozen the fs.

Any ideas what might cause this or how I could gather more details when this happens the next time?

I am using the 0.68 version as shipped in Ubuntu.

markstos commented 6 years ago

You could try running just the fsfreeze command in isolation and see if you can reproduce the issue outside of ec2-consistent-snapshot.

mpdude commented 6 years ago

That works, as almost all snapshots do.

It‘s just one in 500 or thousand that fails :-(

markstos commented 6 years ago

Wow, you've got some scale :) At that frequency, I wonder if there are some hiccups in the EBS storage system. It would be nice in that case if there was a more graceful solution than just hanging though. At this point, I'm not sure what else ec2-consistent-snapshot could do to help. The Perl couldn't be freezing the file system, it must be fsfreeze getting stuck. Are you using ext4 or another filesystem?

mpdude commented 6 years ago

No, ain’t that big. Just make snapshots every other hour on a dozen machines and wait a few weeks :-)

I am using xfs.

Is EBS really involved at that time? My assumption was that freezing the FS was putting a lock or something in place, nothing fancy that would not even go out to EBS...

ehammond commented 6 years ago

I recommend adding a command line unfreeze command that runs every time after the ec2-consistent-snapshot command completes. This way it gets freed up no matter what state it is in when the program terminates.

markstos commented 6 years ago

Looks the command would be fsfreeze --unfreeze $mount_point. From local testing, it looks we need to ignore the return value, as running --unfreeze on a non-frozen file system generates an error (but otherwise seems safe to do).

@ehammond were you suggesting that ec2-consistent-snapshot be patched along those lines, or that users should use this pattern in their own scripts?

mpdude commented 6 years ago

What a simple yet effective solution, should have thought of that myself.

My understanding is that this needs to happen outside ec2-consistent-snapshot, as we're dealing with the (remotely possible?) case that the script fails in a way that does not call the cleanup/shutdown hooks.

As I had a wrapper script anyway that was called from cron, I simply added

/sbin/fsfreeze -u $MOUNTPOINT 2>/dev/null || true

markstos commented 6 years ago

@mpdude Looks good to me, although I'm more likely to spell out flags in scripts like --unfreeze. Scripts are written once but read many times and the extra clarity may help a future reader who doesn't immediately remember or recognize what the -u flag would do.

mpdude commented 6 years ago

Thanks to all involved for your suggestions!

ehammond commented 6 years ago

Yep. Just make sure the unfreeze gets run even if the previous program fails. For example special care must be taken if it's a bash script with the -e flag that exits on any error.

I think we did all we can do to catch program exiting and unfreeze inside the program.

mpdude commented 6 years ago

Reopening this as the problem has occured twice since I added the additional unfreeze attempt in my invoking script :-(.

I've added an additional log message at the end of run_command so that we can hopefully see if the programs were executed successfully.

Like before, the last successful execution of ec2-consistent-snapshot terminated with no problems, including unfreezing the filesystem, unlocking MySQL etc. The next invocation logs sync and /sbin/fsfreeze -f /vol afterwards, where execution seems to stop/hang/terminate (no additional output).

Would you say that if sync worked OK it is safe to assume that the filesystem was not frozen at that point?

Or should I add some safeguards to make sure the filesystem is not frozen before I even start ec2-consistent-snapshot? https://stackoverflow.com/questions/10096183/how-do-you-determine-if-an-xfs-filesystem-is-frozen-programmatically shows a hack how this might be accomplished.

mpdude commented 6 years ago

I tried to run ec2-consistent-snapshot after manually freezing the filesystem before.

Here's what happens:

ec2-consistent-snapshot: Using AWS access key: AKIAIKPVFE3YOBXAIHBQ
ec2-consistent-snapshot: Fri Nov  2 10:55:03 2018: No volume ids specified; discovering volume ids
ec2-consistent-snapshot: Fri Nov  2 10:55:03 2018: Discovering volume ids for: /vol
ec2-consistent-snapshot: Fri Nov  2 10:55:03 2018: Determining instance id
ec2-consistent-snapshot: Fri Nov  2 10:55:03 2018: create EC2 object
ec2-consistent-snapshot: Endpoint: https://ec2.eu-west-1.amazonaws.com
ec2-consistent-snapshot: Fri Nov  2 10:55:03 2018: Fetching instance description for i-d5eaca5e
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: Found EBS block devices for i-d5eaca5e:
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018:     vol-cea5d03d /dev/sda1
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018:     vol-8f36417c /dev/sdf
ec2-consistent-snapshot: Using description 'ec2-consistent-snapshot /vol on i-d5eaca5e (mafalde)' for all snapshot descriptions
ec2-consistent-snapshot: Using tag 'Name=mafalde /vol;host=mafalde;mount=/vol' for all snapshot tags
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: MySQL connect as debian-sys-maint
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: MySQL flush
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: MySQL flush & lock
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: executing 'sync'
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: executing 'fsfreeze -f /vol'
fsfreeze: /vol: freeze failed: Device or resource busy
ec2-consistent-snapshot: ERROR: fsfreeze -f /vol: failed(256)
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: create EC2 object
ec2-consistent-snapshot: Endpoint: https://ec2.eu-west-1.amazonaws.com
ec2-consistent-snapshot: volume_id: vol-8f36417c; description: ec2-consistent-snapshot /vol on i-d5eaca5e (mafalde)
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: aws ec2 create-snapshot vol-8f36417c
snap-046e08b3c10b6c86f
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: executing 'fsfreeze -u /vol'
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: MySQL unlock
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: MySQL disconnect
ec2-consistent-snapshot: Fri Nov  2 10:55:04 2018: done

So, even when the fsfreeze call fails, the script would try to continue. I don't know if this is intended or a bug in error handling. However, this still suggests that either the script terminates when trying to call fsfreeze, or the system() call never returns. 🤷‍♂️

markstos commented 6 years ago

I had other problems with this tool and ended up porting ec2-consistent-snapshot to bash and simplifying it. You can find the result here:

https://github.com/RideAmigosCorp/ec2-consistent-snapshot.sh

One thing you'll see in the source is that we "trap" interrupts and errors and make sure to unfreeze in those cases as well. As part of switching to this tool, I'm not longer helping to maintain the Perl version.

The Perl version depends on a stack that is not supported by AWS, while the bash version uses the official aws CLI, which is supported by AWS.

mpdude commented 6 years ago

@ehammond Is the END block in Perl really powerful enough to catch interrupts and/or signals, or does it deal only with “soft” errors like exceptions?

markstos commented 6 years ago

@mpdude No need to ask @ehammond, the official docs on END are clear enough on this point:

An END code block is executed as late as possible, that is, after perl has finished running the program and just before the interpreter is being exited, even if it is exiting as a result of a die() function. (But not if it's morphing into another program via exec, or being blown out of the water by a signal--you have to trap that yourself (if you can).) You may have multiple END blocks within a file--they will execute in reverse order of definition; that is: last in, first out (LIFO). END blocks are not executed when you run perl with the -c switch, or if compilation fails.

So no, it won't catch signals, like SIGINT or SIGTERM.

The Perl script has some signal handlers installed that do convert the caught signals into a die which would be caught by the END block. Although the signal handlers work globally, they are only setup currently if the MySQL option is used, as seen here:

https://github.com/alestic/ec2-consistent-snapshot/blob/master/ec2-consistent-snapshot#L205

So the code definitely does have the possibility of failing to clean-up and unfreeze if it receives a signal without the MySQL option being set.

I don't think the Bash port has that problem, it unconditionally sets up a signal "trap" before starting. Perhaps there are additional signals that should be trapped, though. Patches welcome.

https://github.com/RideAmigosCorp/ec2-consistent-snapshot.sh/blob/master/ec2-consistent-snapshot

ehammond commented 4 years ago

Thanks for submitting this. Unfortunately, this project is no longer under development in this repo. Anybody is welcome to fork the project and continue development if there is interest.

alestic / ec2-consistent-snapshot

Program aborts, terminates or fails (?) and leaves frozen filesystem behind #96