Open onlyjob opened 8 years ago
@onlyjob Sharing a caching device with two dm-writeboost'd devices is invalid.
I once considered this by embedding some unique identifier of the backing device into the super block of the caching device but concluded there is no perfect solution for this.
The first reason is such perfect identifier doesn't exist (neither device name nor uuid are valid).
And the second reason is we should let the userland tools (e.g. your writeboost) to manage invalid operations. There are lot of invalid operations that is theoretically possible but managing them isn't a role of kernel module but the userland tools.
However, looping forever and you resort to rebooting bit sounds to be fixed. So I want to ask you for more details. I couldn't see what you actually did
@onlyjob
I wrote a test to reproduce the error message you saw in dmesg
class REPRO_115 extends DMTestSuite {
test("endless loop when try to use active caching device") {
Memory(Sector.M(8)) { caching => // 1
Memory(Sector.M(128)) { backing1 => // 2
Writeboost.sweepCaches(caching) // 3
Writeboost.Table(backing1, caching).create { s => // 4
Shell(s"dd if=/dev/urandom of=${s.bdev.path} bs=1M count=128") // write // 5
}
} // 6 (when moving out of block, device is removed)
Memory(Sector.M(64)) { backing2 => // 7
Writeboost.Table(backing2, caching).create { s => // 8
Shell(s"dd if=${s.bdev.path} of=/dev/null bs=1M count=128") // read // 9
}
}
}
}
}
[ 3478.223470] attempt to access beyond end of device
[ 3478.223471] loop1: rw=1, want=254400, limit=131072
[ 3478.223497] attempt to access beyond end of device
[ 3478.223498] loop1: rw=1, want=254408, limit=131072
[ 3478.223525] attempt to access beyond end of device
[ 3478.223526] loop1: rw=1, want=254416, limit=131072
[ 3478.223552] attempt to access beyond end of device
[ 3478.223553] loop1: rw=1, want=254424, limit=131072
[ 3478.223580] attempt to access beyond end of device
[ 3478.223581] loop1: rw=1, want=254432, limit=131072
[ 3478.223607] attempt to access beyond end of device
[ 3478.223609] loop1: rw=1, want=254440, limit=131072
[ 3478.223635] attempt to access beyond end of device
[ 3478.223636] loop1: rw=1, want=254448, limit=131072
[ 3478.223663] attempt to access beyond end of device
[ 3478.223664] loop1: rw=1, want=254456, limit=131072
the test code says:
caching
backing1
caching
s
s
s
backing2
(smaller than backing1
)s
without zeroing outs
(but this is irrelevant)the number is corresponding to the comments in the test code so you can understand the code. (not difficult isn't it?)
the dmesg comes from writeback thread because the dirty cache's destination address is far beyond 64MB (it's around 120MB)
Give me lsblk
I believe it would be irrelevant. Too many disks, etc.
What does "active" mean?
"Active" means in use.
I created one dm-writeboosted device (HDD1+SSD1) then I created another dm-writeboosted device (HDD2+SSD1) where caching device SSD1 was mistakenly re-used. Second writeboosted device was constructed when first one was already in use.
The first wb'd device and the second one are backed by a device of difference sizes?
Probably same size. IMHO irrelevant as you can't reliably use size to detect whether device is used.
You created the second wb'd device before removing the first wb'd device? or after?
Actively used dm-writeboost device was not stopped so yes, second device was created before stopping first one.
I'm not concerned about embedding unique ID to caching devices. Here what's important is to check whether caching device is already allocated to another device before constructing new dm-writeboosted device. Only run-time check to prevent deadlock and potential data loss. Thanks.
I wrote a test for the case but there is no error.
test("making another device while in use") {
Memory(Sector.M(8)) { caching =>
Memory(Sector.M(64)) { backing1 =>
Memory(Sector.M(64)) { backing2 =>
Writeboost.sweepCaches(caching)
Writeboost.Table(backing1, caching).create { s =>
Writeboost.Table(backing2, caching).create { s =>
}
}
}
}
}
}
Here what's important is to check whether caching device is already allocated to another device before constructing new dm-writeboosted device. Only run-time check to prevent deadlock and potential data loss. Thanks.
dm_dev has reference count internally but target code can't access it. I don't like to have my own data structure to manage that.
Well, I don't understand how your test work but I wrecked my system once when I improperly constructed writeboosted device by accidentally reusing caching disk of another writeboosted device that was already in use. It was too easy to make such mistake. Safeguard is important to implement for safety. Similar reasons why system prevents attempts to format device with mounted file system -- fool proof... May protect from accidents...
If we maintain a list of pair (backing, caching) where both values are simply the name passed that's only effective between two reboots (or hot-swapping is another issue this can't avoid the problem described here)
I will look deeper the dm code if we can use the internally managed reference count.
Let's try adding FMODE_EXCL flag and see what changes. There is no target adding this flag though.
static int consume_essential_argv(struct wb_device *wb, struct dm_arg_set *as)
{
int err = 0;
struct dm_target *ti = wb->ti;
err = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
&wb->backing_dev);
if (err) {
DMERR("Failed to get backing_dev");
return err;
}
err = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
&wb->cache_dev);
By default the mode
doesn't have FMODE_EXCL in.
static inline fmode_t get_mode(struct dm_ioctl *param)
{
fmode_t mode = FMODE_READ | FMODE_WRITE;
if (param->flags & DM_READONLY_FLAG)
mode = FMODE_READ;
return mode;
}
it's either WRITE | READ
(normal) or READ
(read only)
The idea of adding FMODE_EXCL alone doesn't work
I've managed to wreck dm-writeboost module by accidentally re-using active caching device by constructing another dm-writeboosted device using the same caching device.
dm-writeboost went into endless loop constantly reading from caching device:
I did not find the way how to interrupt it so I had to resort to emergency reboot.
Please consider introducing safeguard to prevent accidental re-use of caching device.