archiecobbs / s3backer

FUSE/NBD single file backing store via Amazon S3
Other
535 stars 77 forks source link

Data corruption when using NBD mode #184

Closed Nikratio closed 2 years ago

Nikratio commented 2 years ago

As far as I can tell, there is something that corrupts data when running s3backer in NBD mode.

I am running ZFS on top of the NBD devices, and I'm getting frequent errors that I think can only be explained by data corruption. For example, when running examples/create_zpool.py and then trying to import the fresh zpool, I have now several times gotten errors that vdev's had the wrong guid, or just general "I/O errors".

When replacing NBD's s3backer backend with the file backend, the problems all seem to go away.

The errors are exactly of the sort that I would expect my eventual consistency (I think s3backer's protection mechanism do not help when s3backer is restarted), but Amazon is pretty clear about offering strong consistency for everything: https://aws.amazon.com/s3/consistency/

Therefore, the only explanation I have is that something is not working right in the NBD-s3backer read or write path.

I was thinking about creating a simple unit test that writes data through the NBD interface (but not using the NBD server itself), reads it back, and confirms that the contents are correct. However, I strongly suspect that the problem is not that straightforward and it is something about the sequence of operations that ZFS performs.

Is s3backer executing NBD requests synchronously, or are they deferred to background threads? Might it be possible if there is an ordering issue with TRIM and WRITE requests or something like that?

archiecobbs commented 2 years ago

How data is written back is determined by the way it's configured, e.g., --blockCacheWriteDelay. If you eliminate the block cache altogether then reads and writes should be straightforward.

To verify it's not an eventual consistency problem, you could turn on/off the ec_protect layer (via --md5CacheSize).

You can also try to run nbdkit with --threads 1 to see if that changes anything.

We implement NBD TRIM synchronously (see s3b_nbd_plugin_trim()), so assuming S3 is truly consistent then that should be OK. But if you suspect TRIM is involved, just comment out that function and see if that fixes it.

There are ways we can gather more info. For example, create the filesystem with NBD, verify the problem, then unmount and remount with FUSE. If the corruption goes away, then it's a read problem, not a write problem.

Make sure you're not connecting multiple clients to the same server.

Try commenting out s3b_nbd_plugin_can_multi_conn()...

Nikratio commented 2 years ago

Here' what happens when trying to directly import after export (without restarting s3backer):

# nbdkit --unix /tmp/tmpr9hete6o/nbd_socket_sb --foreground --filter=exitlast --threads 16 s3backer s3b_region=eu-west-2 s3b_size=50G s3b_force=true bucket=nikratio-backup/sb s3b_blockSize=32K &
nbdkit: auto-detecting block size and total file size...
nbdkit: auto-detected block size=32k and total size=50g
2022-05-09 05:52:18 INFO: established new mount token 0x5872d5fa
2022-05-09 05:52:18 INFO: mounting nikratio-backup

# nbdkit --unix /tmp/tmpr9hete6o/nbd_socket_lb --foreground --filter=exitlast --threads 16 s3backer s3b_region=eu-west-2 s3b_size=50G s3b_force=true bucket=nikratio-backup/lb s3b_blockSize=512K &
nbdkit: auto-detecting block size and total file size...
nbdkit: auto-detected block size=512k and total size=50g
2022-05-09 05:52:18 INFO: established new mount token 0x19b9d2dc
2022-05-09 05:52:18 INFO: mounting nikratio-backup

# nbd-client -unix /tmp/tmpr9hete6o/nbd_socket_sb /dev/nbd0
Warning: the oldstyle protocol is no longer supported.
This method now uses the newstyle protocol with a default export
Negotiation: ..size = 51200MB
Connected /dev/nbd0

# nbd-client -unix /tmp/tmpr9hete6o/nbd_socket_lb /dev/nbd1
Warning: the oldstyle protocol is no longer supported.
This method now uses the newstyle protocol with a default export
Negotiation: ..size = 51200MB
Connected /dev/nbd1

# zpool create -f -R /zpools -o ashift=9 -o autotrim=on -o failmode=continue -O acltype=posixacl -O relatime=on -O xattr=sa -O compression=zstd-19 -O checksum=sha256 -O sync=disabled -O special_small_blocks=131072 -O redundant_metadata=most -O recordsize=131072 -O encryption=on -O keyformat=passphrase -O keylocation=file:///home/nikratio/lib/s3_zfs.key s3backup /dev/nbd1 special /dev/nbd0

# zfs create s3backup/vostro
# zpool export s3backup
# sync

# zpool import -R /zpools -d /dev/nbd1 -d /dev/nbd0 s3backup
cannot import 's3backup': no such pool available

2022-05-09 05:53:03 INFO: unmount nikratio-backup: initiated
2022-05-09 05:53:03 INFO: unmount nikratio-backup: shutting down filesystem
2022-05-09 05:53:03 INFO: unmount nikratio-backup: clearing mount token
2022-05-09 05:53:04 INFO: unmount nikratio-backup: completed

# dmesg | tail
May 09 05:52:18 vostro.rath.org kernel: nbd0: detected capacity change from 0 to 104857600
May 09 05:52:19 vostro.rath.org kernel: nbd1: detected capacity change from 0 to 104857600
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 32 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 2
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 96 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 2
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 160 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 2
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 224 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 2
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd1: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 544 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 0
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd1, sector 32 op 0x0:(READ) flags 0x0 phys_seg 28 prio class 2
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 608 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 0
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 672 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 0
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd0, sector 736 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
May 09 05:53:03 vostro.rath.org kernel: block nbd1: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: I/O error, dev nbd1, sector 544 op 0x0:(READ) flags 0x0 phys_seg 28 prio class 0
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd1: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd1: Other side returned error (108)
May 09 05:53:03 vostro.rath.org kernel: block nbd0: Receive control failed (result -32)
Nikratio commented 2 years ago

I think the first nbdkit instance must have crashed, but it didn't leave behind a core file anywhere...

archiecobbs commented 2 years ago

So it looks like you have a reliable recipe for reproducing this - which is good. Now let's try changing variables until the problem goes away.

See earlier post for some things to try.

Here's another one to try: set the s3backer block size to 4k in all cases. This should eliminate partial writes because the current NBD code also uses a block size of 4k.

Nikratio commented 2 years ago

For those following along, I have now added a very simple Python & boto based S3 plugin to nbdkit itself (https://gitlab.com/nbdkit/nbdkit/-/tree/master/plugins/S3). This should rule out any issues with memory management, compression, encryption, etc.

I then created a debug version of the plugin (https://gitlab.com/Nikolaus2/nbdkit/-/tree/s3_debug) which maintains a local copy of the bucket and compares contents after each write(), and on each read.

Interestingly enough, data errors continue to happen:

Running zpool create -f -R /zpools -o ashift=9 -o autotrim=on -o failmode=continue -O acltype=posixacl -O relatime=on -O xattr=sa -O compression=zstd-19 -O checksum=sha256 -O sync=disabled -O special_small_blocks=131072 -O redundant_metadata=most -O recordsize=131072 -O encryption=on -O keyformat=passphrase -O keylocation=file:///home/nikratio/lib/s3_zfs.key s3backup /dev/nbd1 special /dev/nbd0
nbdkit: python.4: error: /usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin: pread: error: Traceback (most recent call last):
   File "/usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin", line 203, in pread
    raise RuntimeError(
 RuntimeError: INTERNAL ERROR[1]: read(off=16384, len=32768) differs between local and remote

nbdkit: python.11: error: /usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin: pread: error: Traceback (most recent call last):
   File "/usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin", line 203, in pread
    raise RuntimeError(
 RuntimeError: INTERNAL ERROR[2]: read(off=114688, len=16384) differs between local and remote

nbdkit: python.7: error: /usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin: pread: error: Traceback (most recent call last):
   File "/usr/local/lib/nbdkit/plugins/nbdkit-S3-plugin", line 203, in pread
    raise RuntimeError(
 RuntimeError: INTERNAL ERROR[3]: read(off=376832, len=16384) differs between local and remote

However, these errors are only happening in pread. In other words, when I am reading the data back from S3 directly after pwrite, the content is as expected, but when it is retrieved again by pread() later on, the contents have changed.

Work continues...

archiecobbs commented 2 years ago

I love a good bug hunt. Keep us posted.

Nikratio commented 2 years ago

I am now 99% sure that the problem is with Amazon S3, and it looks exactly like the behavior that I would expect from eventual consistency. Data is sometimes old and sometimes fresh.

I am surprised that s3backer's MD5 cache isn't detecting this. But I guess within the same process the request is silently retried until it succeeds, and if s3backer restarts then the old cache is gone so it can't detect that it delivers outdated data..

I guess I should ask Amazon support for help, but I'm not looking forward to the experience....

archiecobbs commented 2 years ago

I am now 99% sure that the problem is with Amazon S3, and it looks exactly like the behavior that I would expect from eventual consistency. Data is sometimes old and sometimes fresh.

Hmmm. I'm skeptical that S3 would be so glaringly in violation of their documented behavior.

I am surprised that s3backer's MD5 cache isn't detecting this. But I guess within the same process the request is silently retried until it succeeds, and if s3backer restarts then the old cache is gone so it can't detect that it delivers outdated data..

Just to clarify terminology, the MD5 cache doesn't actually detect anything. All it does is guarantee consistency if the actual convergence time is less than --md5CacheTime and --minWriteDelay.

For what you're suggesting is happening to happen, the convergence time would have to be longer than the time it takes you to stop and restart s3backer.

So you're asserting that not only is S3 violating their stated consistency guarantees, they are doing so with a fairly large convergence time.

If you really believe this the next step would be to write a stripped-down test case that proves it.

archiecobbs commented 2 years ago

Dumb question.. do you see the same problem when running with --test instead of over S3?

archiecobbs commented 2 years ago

It just occurred to me that the problem might be due to how s3backer handles writes of partial blocks. Partial writes are handled by reading, patching, and writing back. If there are two simultaneous partial writes that don't overlap but happen to fall with the same s3backer block then one of them could "lose".

Can you test whether the problem occurs when you set the s3backer block size to 4K (same as NBD), so that there are never any partial writes?

Nikratio commented 2 years ago

Hmm, I think you are right. This is definitely something that could happen, and may well be what's happening here. You wouldn't see it in FUSE mode because this forces serialized writes. And it explains why I haven't seen this happen when forcing write serialization in NBD (I gave up on that because it was so slow, resulting in timeouts in the kernel block layer).

Nikratio commented 2 years ago

There's no way to set an NBD blocksize, just a *maximum" blocksize. And even that doesn't ensure that blocks are aligned, so I can't test it that way. I'll just fix the issue and see if the problem still occurs.

archiecobbs commented 2 years ago

There's no way to set an NBD blocksize, just a maximum blocksize

We pass the -block-size 4096 flag to nbd-client (here). This is what I was referring to, though I admit I'm not clear exactly on what it actually does.

archiecobbs commented 2 years ago

Should hopefully be fixed by d1bce95.