Closed mraaman closed 7 years ago
@mraaman
I am afraid you have terribly misunderstood the algorithm.
The code you are pointing out is
static void __acquire_new_seg(struct wb_device *wb, u64 id)
{
struct segment_header *new_seg = get_segment_header_by_id(wb, id);
/*
* We wait for all requests to the new segment is consumed.
* Mutex taken guarantees that no new I/O to this segment is coming in.
*/
wait_event(wb->inflight_ios_wq,
!atomic_read(&new_seg->nr_inflight_ios));
wait_for_writeback(wb, SUB_ID(id, wb->nr_segments));
and let's look at the SUB_ID
#define SUB_ID(x, y) ((x) > (y) ? (x) - (y) : 0)
this is basically a subtraction so the condition to wait is wb->last_writeback_segment_id >= (id - nr_segments)
which waits for the segment is written back and then freed for the new overwrite.
This wait only powers iff no writeback has happened for the entire one-cycle of logging on the caching device so it's exceptional.
Why not just write it to cache and then later flush it to the backing device?
writeboost is just doing so. It has a background thread to write back from caching device to backing device the dirty segments.
@mraaman
I want to create a power saving function to dm-writeboost that will allow the backing device to be in power save mode and use the cache of incoming writes. Then periodically or threshold triggered write back data to the backing device.
I don't really see what you are saying.
Writeboost does have writeback_threshold which is even tunable runtime. Isn't this what you want?
If I for example copy a large file and I am using nr_cur_batched_writeback 32 1/32 of the segments are written to the backing device immediately. This is because of the new segment code above. I want 0 blocks written to the backing device (and all of it to the caching device). Then at a later point I want to write all of the 32 segments (and all of dirty blocks) to the backing device.
Regarding the threshold writeback_threshold is a threshold for io utilization. I have implemented a threshold for the storage. So basically I could for example set it to 100G which means 100G is written to the cache before anything is written to the backing device. Then once 100G is crossed all of it is written to the backing device. That way the backing device can idle until there is alot of work instead of spinning up for small changes.
@mraaman
If I for example copy a large file and I am using nr_cur_batched_writeback 32 1/32 of the segments are written to the backing device immediately.
No. nr_cur_batched_writeback isn't such a knob. How big is your caching device? How large is are you trying to write? This example lacks essential information.
If caching device is very large, say 128GB, and the file is 32GB. Writeboost doesn't write back anything unless you specifies writeback_threshold. When you set writeback_threshold 0, writeback is disabled until there is no empty segment at all.
@mraaman
100G is written to the cache before anything is written to the backing device.
So your idea is a threshold by "cache utilization"? writeback_threshold only sees the io utilization of the backing device so if it enabled (set some value larger than 0), there is no way to defer writeback until caching device is filled with certain amount of dirty segments.
The cache is 5G and the file is 1G. There are 0 dirty blocks in the cache when copy starts. Roughly 1/32 of the blocks are written to the backing disk in 500K writes. The remaining data is kept in cache until the writeback_threshold is > 0 and the disk is idle enough at which point all remaining dirty data is written to the backing device as expected. My question is why this initial write of 1/32 is made to the backing device?
What do you mean when you say "This wait only powers iff no writeback has happened for the entire one-cycle of logging on the caching device so it's exceptional."
@mraaman
Sorry, I am asking you "what is your expectation".
You should share all information to discuss.
What is the parameter you set for dm-writeboost? I can't even see this.
@mraaman
wbstatus in
https://github.com/akiradeveloper/dm-writeboost-tools
may help you understand what is happening under writeboost and to support your idea.
Since the way writeboost works is deterministic and there are tools to analyze, the word "Roughly" is nonsense to talk about technical stuffs.
My expectation is that 100% of the file is written to cache and 0% written to the backing device. Now 100% is written to cache and 3% is written to the backing device. In my current setup all parameters are the default ones. Then I use dmsetup message to change the writeback_threshold so that 100% is written to the backing device. That part works as expected.
@mraaman
In my current setup all parameters are the default ones. Then I use dmsetup message to change the writeback_threshold so that 100% is written to the backing device.
What is the setting while you are copying the file?
@mraaman
Let's talk on evidence
all output should be shared.
So this is the init, just a loopback device for testing:
dd if=/dev/zero of=/data/ssd bs=4M count=1250
5242880000 /data/ssd
losetup /dev/loop0 /data/ssd
sz=`blockdev --getsize /dev/sdc1` (1953523120)
dmsetup create wbdev --table "0 $sz writeboost /dev/sdc1 /dev/loop0"
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 4.9G 0 loop
└─wbdev 254:0 0 931.5G 0 dm
sdc 8:32 0 931.5G 0 disk
└─sdc1 8:33 0 931.5G 0 part
└─wbdev 254:0 0 931.5G 0 dm
wbdev: 0 1953523120 writeboost 8:33 7:0
cursor pos = writeboost
# of cache blocks = 0
# of segments = 1269746
current id = 9998
last flushed id = 1
last writeback id = 0
# of dirty cache blocks = 0
# of partial flushes = 0
write? hit? on_buffer? fullsize?
0 0 0 0 0
0 0 0 1 0
0 0 1 0 266
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
1 0 0 0 0
1 0 0 1 0
1 0 1 0 0
1 0 1 1 0
1 1 0 0 0
1 1 0 1 0
1 1 1 0 0
1 1 1 1 0
Now I mounted filesytem (ext3) did a few ls and rm old files
cursor pos = writeboost
# of cache blocks = 2032
# of segments = 1269746
current id = 9998
last flushed id = 17
last writeback id = 16
# of dirty cache blocks = 2
# of partial flushes = 0
write? hit? on_buffer? fullsize?
0 0 0 0 9
0 0 0 1 1
0 0 1 0 1468
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
1 0 0 0 0
1 0 0 1 0
1 0 1 0 13
1 0 1 1 0
1 1 0 0 0
1 1 0 1 0
1 1 1 0 3
1 1 1 1 0
Now I copied the file 1597412817 bytes
cursor pos = writeboost
# of cache blocks = 393446
# of segments = 1269746
current id = 9998
last flushed id = 3099
last writeback id = 3098
# of dirty cache blocks = 133
# of partial flushes = 0
write? hit? on_buffer? fullsize?
0 0 0 0 376211
0 0 0 1 1
0 0 1 0 1468
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
1 0 0 0 0
1 0 0 1 0
1 0 1 0 390481
1 0 1 1 0
1 1 0 0 0
1 1 0 1 0
1 1 1 0 7
1 1 1 1 0
During the copy I got 112 writes to 8,32 (sdc). Most of them where 524288 bytes, a few were 4096 bytes. So this means about 50Mb were written to the backing device (~3%).
@mraaman oh wbstatus is broken (I will fix this). but I think you enabled writeback. did you send dmsetup message
to wb device?
I want the dmsetup status
before piping it out to wbstatus too
@mraaman
I see.
void wait_for_writeback(struct wb_device *wb, u64 id)
{
wb->urge_writeback = true;
wake_up_process(wb->writeback_daemon);
wait_event(wb->writeback_wait_queue,
atomic64_read(&wb->last_writeback_segment_id) >= id);
smp_rmb();
wb->urge_writeback = false;
}
this code first set true to urge_writeback and then wakes the writeback daemon up. If it wakes up soon then should_writeback
returns true then it starts a single writeback. It's a rare case and isn't critical for consistency.
If so this is a timing bug which should be fixed in 2.2.7.
@mraaman but still I need to make sure that you didn't set up writeback at all
I see. As for the wbstatus it's not a bug.
dmsetup status wbdev | wbstatus
instead
(I had the same mistake in my youtube video https://www.youtube.com/watch?v=0IkTA8KJ-hU what a shame)
Ok, I'll rerun the test above and paste the raw output. I did not send any message to wb device.
Should I also check out master branch and give it a go? Now I run 2.2.6 release
@mraaman ok with 2.2.6
Same setup as before and I copied the same file.
0 1953523120 writeboost 2163 798576 6288 18 17 0 46 1 682 0 0 0 0 0 0 0 46 0 0 0 6 0 0 8 10 writeback_threshold 0 nr_cur_batched_writeback 0 sync_data_interval 0 update_sb_record_interval 0 read_cache_threshold 0
wbstatus
cursor pos = 2163
# of cache blocks = 798576
# of segments = 6288
current id = 18
last flushed id = 17
last writeback id = 0
# of dirty cache blocks = 46
# of partial flushes = 8
write? hit? on_buffer? fullsize?
0 0 0 0 1
0 0 0 1 682
0 0 1 0 0
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
1 0 0 0 0
1 0 0 1 46
1 0 1 0 0
1 0 1 1 0
1 1 0 0 0
1 1 0 1 6
1 1 1 0 0
1 1 1 1 0
0 1953523120 writeboost 392938 798576 6288 3095 3094 96 380370 1 682 0 0 0 0 0 0 0 390445 0 0 0 6 0 0 10 10 writeback_threshold 0 nr_cur_batched_writeback 32 sync_data_interval 0
wbstatus
update_sb_record_interval 0 read_cache_threshold 0
cursor pos = 392938
# of cache blocks = 798576
# of segments = 6288
current id = 3095
last flushed id = 3094
last writeback id = 96
# of dirty cache blocks = 380370
# of partial flushes = 10
write? hit? on_buffer? fullsize?
0 0 0 0 1
0 0 0 1 682
0 0 1 0 0
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
1 0 0 0 0
1 0 0 1 390445
1 0 1 0 0
1 0 1 1 0
1 1 0 0 0
1 1 0 1 6
1 1 1 0 0
1 1 1 1 0
This time 84 writes, as before majority 524288 bytes
@mraaman Thanks for the rerun and reporting. With this evidence I concluded this is a bug.
I want to create a power saving function to dm-writeboost that will allow the backing device to be in power save mode and use the cache of incoming writes.
If you want to continue discussing about the idea, please make an another issue because this issue is going to be closed when I fix the bug.
@mraaman Could you please test with fix-171 branch?
My cache-file was smaller this time (3146Mb) but still big enough to hold the data for the file.
0 1953523120 writeboost 1653 798576 6288 14 13 0 10 1 630 0 0 0 0 0 0 0 10 0 0 0 4 0 0 6 10 writeback_threshold 0 nr_cur_batched_writeback 0 sync_data_interval 0 update_sb_record_interval 0 read_cache_threshold 0
wbstatus
cursor pos = 1653
# of cache blocks = 798576
# of segments = 6288
current id = 14
last flushed id = 13
last writeback id = 0
# of dirty cache blocks = 10
# of partial flushes = 6
write? hit? on_buffer? fullsize?
0 0 0 0 1
0 0 0 1 630
0 0 1 0 0
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
1 0 0 0 0
1 0 0 1 10
1 0 1 0 0
1 0 1 1 0
1 1 0 0 0
1 1 0 1 4
1 1 1 0 0
1 1 1 1 0
0 1953523120 writeboost 392811 798576 6288 3094 3093 0 390418 1 645 0 0 0 0 0 0 0 390418 0 0 0 4 0 0 10 10 writeback_threshold 0 nr_cur_batched_writeback 0 sync_data_interval 0 update_sb_record_interval 0 read_cache_threshold 0
wbstatus
cursor pos = 392811
# of cache blocks = 798576
# of segments = 6288
current id = 3094
last flushed id = 3093
last writeback id = 0
# of dirty cache blocks = 390418
# of partial flushes = 10
write? hit? on_buffer? fullsize?
0 0 0 0 1
0 0 0 1 645
0 0 1 0 0
0 0 1 1 0
0 1 0 0 0
0 1 0 1 0
0 1 1 0 0
0 1 1 1 0
1 0 0 0 0
1 0 0 1 390418
1 0 1 0 0
1 0 1 1 0
1 1 0 0 0
1 1 0 1 4
1 1 1 0 0
1 1 1 1 0
This time there were zero writes! Excellent!
However I could also see that the copy process was reading metadata from the backing disk during the copy. Do you know the reason for this? It read 14 times and 4096 bytes each time.
@mraaman
This time there were zero writes! Excellent!
Great.
However I could also see that the copy process was reading metadata from the backing disk during the copy. Do you know the reason for this? It read 14 times and 4096 bytes each time.
It depends on how you set up the system.
If you first make the filesystem on the backing device and stacks caching then, it may read filesystem metadata in copying. Copying is seen as only writing in the userland but it may trigger some metadata read inside the kernel. (e.g. read-modify-write)
But if you first stacks caching and then make the filesystem over the stacked wbdev then all the filesystem metadata should be on the caching device and then hit.
Ok, great. I guess we can close this issue then?
Don't close by yourself. I am going to close this issue when I merge the fix.
This is not an issue. I am looking for information on the internals of dm-writeboost.
When a new segment is created (acquire_new_seg()) wait_for_writeback() is called and it will write a single segment (or more precisely until wb->last_writeback_segment_id >= id) to the backing device. How come this is done? Why not just write it to cache and then later flush it to the backing device?
I want to create a power saving function to dm-writeboost that will allow the backing device to be in power save mode and use the cache of incoming writes. Then periodically or threshold triggered write back data to the backing device. wait_for_writeback() kills this since it will wake up the backing device every time a new segment is created.