EndPointCorp / end-point-blog

End Point Dev blog
https://www.endpointdev.com/blog/
17 stars 65 forks source link

Comments for PostgreSQL EC2/EBS/RAID 0 snapshot backup #272

Open phinjensen opened 6 years ago

phinjensen commented 6 years ago

Comments for https://www.endpointdev.com/blog/2010/02/postgresql-ec2-ebs-raid0-snapshot/ By Jon Jensen

To enter a comment:

  1. Log in to GitHub
  2. Leave a comment on this issue.
phinjensen commented 6 years ago
original author: Ethan Rowe
date: 2010-02-23T14:12:52-05:00

Thanks for writing it up, Jon. I'm psyched you dug into this so much (as you well know). :)

One of the things we had wondered about was how RAID would respond to inconsistencies in the underlying volumes owing to the lack of atomicity inherent in snapshots of independent EBS devices.

The choice to give it a try is informed, at least in my view, on the principle that a RAID controller that cannot deal with inconsistencies in the array members is a RAID controller that can't work in production anyway.

Thanks again.

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2010-02-23T14:17:17-05:00

Thanks, Ethan.

Not only should the RAID controller be able to deal with it, I'm not sure there's any "it" to deal with. Though of course this is software RAID, so the "controller" is just another block device later, not an actual hardware controller.

Why would the RAID metadata ever change unless the administrator specifically does something to change it? The on-disk state of anything having to do with RAID shouldn't be volatile at all.

The data within the RAID "container" is volatile, of course, but that's operating system block device-level stuff that is a matter for the filesystem.

So the race condition, such as it is, revolves primarily around filesystem metadata (very little, if atime updates are off and no files are being created or unlinked).

phinjensen commented 6 years ago
original author: ajaya
date: 2010-02-23T16:38:53-05:00

Have you looked at ec2-consistent-snapshot from eric hammond that can be adapted to postgres?

I saw a link around https://code.launchpad.net/~adler/ec2-consistent-snapshot/postgresql

phinjensen commented 6 years ago
original author: Anonymous
date: 2010-02-23T17:38:07-05:00

I have been meaning to push up some code to that launchpad project. My guess is that it would be safer to 1) keep wall on separate fs 2) checkpoint or start_backup 3) xfs_freeze data fs 4) checkpoint all ebs devices and 5) xfs unfreeze data fs . Only lightly tested, though

phinjensen commented 6 years ago
original author: jason
date: 2010-02-23T19:28:01-05:00

Why not use LVM on top of the RAID and use an LVM snapshot. That would be consistent.

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2010-02-23T19:32:32-05:00

Jason, we thought about using LVM and yes, it would be consistent, but it would have to be done on the host in question, and won't help offload any I/O from the already I/O-saturated host.

Unless there's some way to share the same LVM block devices from multiple hosts that I don't know about?

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2010-02-23T19:36:38-05:00

Ajaya, that project looks interesting so I'll check it out. Thanks for the link.

Adler: Doesn't xfs_freeze block all filesystem writes? That still may be better than shutting down Postgres altogether during the snapshot, but it's going to add at least a little downtime. I'd like to try it.

phinjensen commented 6 years ago
original author: Anonymous
date: 2010-02-24T07:16:02-05:00

Yes, that will block all writes, but it should only take a few seconds to initiate the snapshots and since the WAL is not blocked, the backends will continue to accept most queries in the meantime. This way you are guaranteed that your snapshots are consistent.

ec2-consistent-snapshot will timeout if any one snapshot request does not initiate in more than 10 seconds (by default). So that gives you a ceiling on what pauses may happen if EBS is not acting as expected.

-adler

phinjensen commented 6 years ago
original author: Robert Treat
date: 2010-02-24T11:57:36-05:00

We've been doing these kinds of experiments at OmniTI, using ZFS snapshots, for years, so it's nice to see others getting into the game.

If you're going to run these in production, you really should build it on top of the pitr facility, and your thinking is on track wrt using the pg_start/stop backup facilities, and to grab the xlogs last during that time.

We normally build these on top of running pitr instances anyway, but a simpler solution for stand alone systems might be to just use /bin/false and grabbing the xlog dir, I'd probably need to do some experiments on that before recommending it.

Anyway, nice write up!

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2010-02-24T12:31:16-05:00

Thanks for the note, Robert.

I guess I didn't clearly state that we've been doing this in production for years using LVM2 and NetApp snapshots, depending on the client's hardware.

What I was doing here was trying the same thing on EBS with 4 atomic snapshots that together make for a non-atomic RAID 0 snapshot, which isn't theoretically pure but did work anyway.

phinjensen commented 6 years ago
original author: Greg Smith
date: 2010-02-24T12:43:59-05:00

Matching Robert's suggestion, just because you've been doing this successfully for a while doesn't make me cringe less. If you're using pg_start_backup, you really should be saving the archive segments it generates while doing the snapshot shuffle and getting a completely clean copy that goes through recovery properly. I'm sure the database comes fine anyway most of the time. Murphy's law say the one time it doesn't will inevitably be the time you actually need that backup functional the most.

Just providing a minimal archive_command and saving its output avoids all this concern about whether your snapshot is perfectly atomic. That sidesteps concerns that might think you want LVM (which is never a good thing to introduce into an already working system due to its overhead), or want to freeze XFS (always scary and disruptive).

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2010-02-24T13:22:42-05:00

Greg: Yes, as I noted in my conclusion, I would definitely use the stored WAL files during the backup if this were in production.

Our production snapshots (for other clients) with LVM have worked out very well and are fully consistent thanks to being on a single snapshottable block device, so we're very happy with those but just can't do that with the 4-device EBS setup here.

phinjensen commented 6 years ago
original author: Josh Berkus
date: 2010-02-24T18:36:06-05:00

Greg,

The problem we've had with EBS wasn't average throughput, it was minimum throughput; that is, even with RAIDed EBS, sometimes I/O would drop to nothing due to competing users of the cloud. Only for brief periods, but they were still sufficient to make database requests time out. Have you not had this experience?

phinjensen commented 6 years ago
original author: Ethan Rowe
date: 2010-02-25T14:27:32-05:00

Josh (Berkus):

If memory serves, we have experienced the throughput issues you've described, but when using the stock local storage such as it is on an EC2 instance. I do not think it's been a real problem for us since going to the 4-EBS-volume RAID0 configuration. However, that doesn't mean it can't or won't happen.

To that end, we had the RAID0 volume simply stop responding at one point, necessitating a failover to the warm standby. The source of the failure was a mystery, and could have been one of XFS, the RAID software, EBS itself.

phinjensen commented 6 years ago
original author: Log Buffer
date: 2010-02-26T16:20:20-05:00

[...]Jon Jensen of End Point’s Blog posts a HOWTO on PostgreSQL EC2/EBS/RAID 0 snapshot backup.[...]

Log Buffer #180

phinjensen commented 6 years ago
original author: Anonymous
date: 2010-03-01T10:23:32-05:00

For what it's worth, my experience is that the xfs fs will often not be recoverable if it is not frozen (or "quiesced" in file system parlance) during the window when multiple underlying devices are snapshotted. You were able to get this down to ~1 second by initiating the snapshots in parallel, but there is no such guarantee that it is good enough (that I know of). You seem to understand this, but it's not spelled out in your post.

I imagine that Linux multi-disk has some tolerance for recovery from non-atomic situations, but it may just involve some luck.

It would be nice if amazon would provide enhancements to the Linux file-systems and/or the RAID drivers to help deal with this issue. Or even better, they could provide a layer on top of EBS that manages all this for you.

-adler

phinjensen commented 6 years ago
original author: Cloud computing
date: 2010-07-29T11:59:53-04:00

I guess, as with all mainstream emerging technologies there are still bugs to iron out. Yet, as you've demonstrated there's always a clever work around. Thanks for posting, it's a nice, insightful bit of reading.

John.

phinjensen commented 6 years ago
original author: syrnick
date: 2012-02-21T17:17:37-05:00

Awesome research!

Do you have any follow-ups to this? Was 9.x better with these snapshots? Was it working fine ever since? Were insurmountable/unforseen challenges?

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2012-02-22T01:06:22-05:00

syrnick: Thanks for the note. Nope, no follow-ups right now. I usually try to steer people running mission-critical Postgres setups away from AWS regardless of the type of storage, since it almost always has worse I/O than a nonvirtualized system with standard direct-attached drives, or a SAN. People are still using AWS for Postgres, but I think it's more work and less reliable than makes sense as a default.

phinjensen commented 6 years ago
original author: syrnick
date: 2012-02-22T01:17:37-05:00

We're on AWS already, but I'd love to set up the backups exactly as you described. In fact, we already have chef-based snapshotter ready for this (with EBS snapshots and S3 WAL archiving), but I haven't fully tested the recovery. So, your post was quite inspiring.

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2012-02-22T01:50:48-05:00

Ah, cool. Well, I'd say test the recovery of a snapshot, say, once a week for a month to build up your confidence in both the snapshotting and the recovery strategy, and you should be good!