akiradeveloper / dm-writeboost

Log-structured Caching for Linux
GNU General Public License v2.0
120 stars 19 forks source link

Details about wait_for_writeback() #171

Closed mraaman closed 7 years ago

mraaman commented 7 years ago

This is not an issue. I am looking for information on the internals of dm-writeboost.

When a new segment is created (acquire_new_seg()) wait_for_writeback() is called and it will write a single segment (or more precisely until wb->last_writeback_segment_id >= id) to the backing device. How come this is done? Why not just write it to cache and then later flush it to the backing device?

I want to create a power saving function to dm-writeboost that will allow the backing device to be in power save mode and use the cache of incoming writes. Then periodically or threshold triggered write back data to the backing device. wait_for_writeback() kills this since it will wake up the backing device every time a new segment is created.

akiradeveloper commented 7 years ago

@mraaman

I am afraid you have terribly misunderstood the algorithm.

The code you are pointing out is

static void __acquire_new_seg(struct wb_device *wb, u64 id)
{
    struct segment_header *new_seg = get_segment_header_by_id(wb, id);

    /*
     * We wait for all requests to the new segment is consumed.
     * Mutex taken guarantees that no new I/O to this segment is coming in.
     */
    wait_event(wb->inflight_ios_wq,
        !atomic_read(&new_seg->nr_inflight_ios));

    wait_for_writeback(wb, SUB_ID(id, wb->nr_segments));

and let's look at the SUB_ID

#define SUB_ID(x, y) ((x) > (y) ? (x) - (y) : 0)

this is basically a subtraction so the condition to wait is wb->last_writeback_segment_id >= (id - nr_segments) which waits for the segment is written back and then freed for the new overwrite.

This wait only powers iff no writeback has happened for the entire one-cycle of logging on the caching device so it's exceptional.

Why not just write it to cache and then later flush it to the backing device?

writeboost is just doing so. It has a background thread to write back from caching device to backing device the dirty segments.

akiradeveloper commented 7 years ago

@mraaman

I want to create a power saving function to dm-writeboost that will allow the backing device to be in power save mode and use the cache of incoming writes. Then periodically or threshold triggered write back data to the backing device.

I don't really see what you are saying.

Writeboost does have writeback_threshold which is even tunable runtime. Isn't this what you want?

mraaman commented 7 years ago

If I for example copy a large file and I am using nr_cur_batched_writeback 32 1/32 of the segments are written to the backing device immediately. This is because of the new segment code above. I want 0 blocks written to the backing device (and all of it to the caching device). Then at a later point I want to write all of the 32 segments (and all of dirty blocks) to the backing device.

Regarding the threshold writeback_threshold is a threshold for io utilization. I have implemented a threshold for the storage. So basically I could for example set it to 100G which means 100G is written to the cache before anything is written to the backing device. Then once 100G is crossed all of it is written to the backing device. That way the backing device can idle until there is alot of work instead of spinning up for small changes.

akiradeveloper commented 7 years ago

@mraaman

If I for example copy a large file and I am using nr_cur_batched_writeback 32 1/32 of the segments are written to the backing device immediately.

No. nr_cur_batched_writeback isn't such a knob. How big is your caching device? How large is are you trying to write? This example lacks essential information.

If caching device is very large, say 128GB, and the file is 32GB. Writeboost doesn't write back anything unless you specifies writeback_threshold. When you set writeback_threshold 0, writeback is disabled until there is no empty segment at all.

akiradeveloper commented 7 years ago

@mraaman

100G is written to the cache before anything is written to the backing device.

So your idea is a threshold by "cache utilization"? writeback_threshold only sees the io utilization of the backing device so if it enabled (set some value larger than 0), there is no way to defer writeback until caching device is filled with certain amount of dirty segments.

mraaman commented 7 years ago

The cache is 5G and the file is 1G. There are 0 dirty blocks in the cache when copy starts. Roughly 1/32 of the blocks are written to the backing disk in 500K writes. The remaining data is kept in cache until the writeback_threshold is > 0 and the disk is idle enough at which point all remaining dirty data is written to the backing device as expected. My question is why this initial write of 1/32 is made to the backing device?

What do you mean when you say "This wait only powers iff no writeback has happened for the entire one-cycle of logging on the caching device so it's exceptional."

akiradeveloper commented 7 years ago

@mraaman

Sorry, I am asking you "what is your expectation".

You should share all information to discuss.

What is the parameter you set for dm-writeboost? I can't even see this.

akiradeveloper commented 7 years ago

@mraaman

wbstatus in

https://github.com/akiradeveloper/dm-writeboost-tools

may help you understand what is happening under writeboost and to support your idea.

akiradeveloper commented 7 years ago

Since the way writeboost works is deterministic and there are tools to analyze, the word "Roughly" is nonsense to talk about technical stuffs.

mraaman commented 7 years ago

My expectation is that 100% of the file is written to cache and 0% written to the backing device. Now 100% is written to cache and 3% is written to the backing device. In my current setup all parameters are the default ones. Then I use dmsetup message to change the writeback_threshold so that 100% is written to the backing device. That part works as expected.

akiradeveloper commented 7 years ago

@mraaman

In my current setup all parameters are the default ones. Then I use dmsetup message to change the writeback_threshold so that 100% is written to the backing device.

What is the setting while you are copying the file?

akiradeveloper commented 7 years ago

@mraaman

Let's talk on evidence

  1. create wb device
  2. lsblk
  3. dmsetup table
  4. dmsetup status | wbstatus
  5. copy the file
  6. dmsetup status | wbstatus

all output should be shared.

mraaman commented 7 years ago

So this is the init, just a loopback device for testing:

dd if=/dev/zero of=/data/ssd bs=4M count=1250
5242880000 /data/ssd
losetup /dev/loop0 /data/ssd
sz=`blockdev --getsize /dev/sdc1` (1953523120)
dmsetup create wbdev --table "0 $sz writeboost /dev/sdc1 /dev/loop0"

lsblk

NAME      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0       7:0    0   4.9G  0 loop 
└─wbdev   254:0    0 931.5G  0 dm   
sdc         8:32   0 931.5G  0 disk 
└─sdc1      8:33   0 931.5G  0 part 
  └─wbdev 254:0    0 931.5G  0 dm

dmsetup table

wbdev: 0 1953523120 writeboost  8:33 7:0

dmsetup status | wbstatus

cursor pos              = writeboost
# of cache blocks       = 0
# of segments           = 1269746
current id              = 9998
last flushed id         = 1
last writeback id       = 0
# of dirty cache blocks = 0
# of partial flushes    = 0
write? hit? on_buffer? fullsize?
0      0    0          0         0
0      0    0          1         0
0      0    1          0         266
0      0    1          1         0
0      1    0          0         0
0      1    0          1         0
0      1    1          0         0
0      1    1          1         0
1      0    0          0         0
1      0    0          1         0
1      0    1          0         0
1      0    1          1         0
1      1    0          0         0
1      1    0          1         0
1      1    1          0         0
1      1    1          1         0

Now I mounted filesytem (ext3) did a few ls and rm old files

cursor pos              = writeboost
# of cache blocks       = 2032
# of segments           = 1269746
current id              = 9998
last flushed id         = 17
last writeback id       = 16
# of dirty cache blocks = 2
# of partial flushes    = 0
write? hit? on_buffer? fullsize?
0      0    0          0         9
0      0    0          1         1
0      0    1          0         1468
0      0    1          1         0
0      1    0          0         0
0      1    0          1         0
0      1    1          0         0
0      1    1          1         0
1      0    0          0         0
1      0    0          1         0
1      0    1          0         13
1      0    1          1         0
1      1    0          0         0
1      1    0          1         0
1      1    1          0         3
1      1    1          1         0

Now I copied the file 1597412817 bytes

cursor pos              = writeboost
# of cache blocks       = 393446
# of segments           = 1269746
current id              = 9998
last flushed id         = 3099
last writeback id       = 3098
# of dirty cache blocks = 133
# of partial flushes    = 0
write? hit? on_buffer? fullsize?
0      0    0          0         376211
0      0    0          1         1
0      0    1          0         1468
0      0    1          1         0
0      1    0          0         0
0      1    0          1         0
0      1    1          0         0
0      1    1          1         0
1      0    0          0         0
1      0    0          1         0
1      0    1          0         390481
1      0    1          1         0
1      1    0          0         0
1      1    0          1         0
1      1    1          0         7
1      1    1          1         0

During the copy I got 112 writes to 8,32 (sdc). Most of them where 524288 bytes, a few were 4096 bytes. So this means about 50Mb were written to the backing device (~3%).

akiradeveloper commented 7 years ago

@mraaman oh wbstatus is broken (I will fix this). but I think you enabled writeback. did you send dmsetup message to wb device?

I want the dmsetup status before piping it out to wbstatus too

akiradeveloper commented 7 years ago

@mraaman

I see.

void wait_for_writeback(struct wb_device *wb, u64 id)
{
    wb->urge_writeback = true;
    wake_up_process(wb->writeback_daemon);
    wait_event(wb->writeback_wait_queue,
        atomic64_read(&wb->last_writeback_segment_id) >= id);
    smp_rmb();
    wb->urge_writeback = false;
}

this code first set true to urge_writeback and then wakes the writeback daemon up. If it wakes up soon then should_writeback returns true then it starts a single writeback. It's a rare case and isn't critical for consistency.

If so this is a timing bug which should be fixed in 2.2.7.

@mraaman but still I need to make sure that you didn't set up writeback at all

akiradeveloper commented 7 years ago

I see. As for the wbstatus it's not a bug.

dmsetup status wbdev | wbstatus instead

(I had the same mistake in my youtube video https://www.youtube.com/watch?v=0IkTA8KJ-hU what a shame)

mraaman commented 7 years ago

Ok, I'll rerun the test above and paste the raw output. I did not send any message to wb device.

Should I also check out master branch and give it a go? Now I run 2.2.6 release

akiradeveloper commented 7 years ago

@mraaman ok with 2.2.6

mraaman commented 7 years ago

Same setup as before and I copied the same file.

Before copy

0 1953523120 writeboost 2163 798576 6288 18 17 0 46 1 682 0 0 0 0 0 0 0 46 0 0 0 6 0 0 8 10 writeback_threshold 0 nr_cur_batched_writeback 0 sync_data_interval 0 update_sb_record_interval 0 read_cache_threshold 0

wbstatus

cursor pos              = 2163
# of cache blocks       = 798576
# of segments           = 6288
current id              = 18
last flushed id         = 17
last writeback id       = 0
# of dirty cache blocks = 46
# of partial flushes    = 8
write? hit? on_buffer? fullsize?
0      0    0          0         1
0      0    0          1         682
0      0    1          0         0
0      0    1          1         0
0      1    0          0         0
0      1    0          1         0
0      1    1          0         0
0      1    1          1         0
1      0    0          0         0
1      0    0          1         46
1      0    1          0         0
1      0    1          1         0
1      1    0          0         0
1      1    0          1         6
1      1    1          0         0
1      1    1          1         0

After copy

0 1953523120 writeboost 392938 798576 6288 3095 3094 96 380370 1 682 0 0 0 0 0 0 0 390445 0 0 0 6 0 0 10 10 writeback_threshold 0 nr_cur_batched_writeback 32 sync_data_interval 0 

wbstatus

update_sb_record_interval 0 read_cache_threshold 0
cursor pos              = 392938
# of cache blocks       = 798576
# of segments           = 6288
current id              = 3095
last flushed id         = 3094
last writeback id       = 96
# of dirty cache blocks = 380370
# of partial flushes    = 10
write? hit? on_buffer? fullsize?
0      0    0          0         1
0      0    0          1         682
0      0    1          0         0
0      0    1          1         0
0      1    0          0         0
0      1    0          1         0
0      1    1          0         0
0      1    1          1         0
1      0    0          0         0
1      0    0          1         390445
1      0    1          0         0
1      0    1          1         0
1      1    0          0         0
1      1    0          1         6
1      1    1          0         0
1      1    1          1         0

This time 84 writes, as before majority 524288 bytes

akiradeveloper commented 7 years ago

@mraaman Thanks for the rerun and reporting. With this evidence I concluded this is a bug.

I want to create a power saving function to dm-writeboost that will allow the backing device to be in power save mode and use the cache of incoming writes.

If you want to continue discussing about the idea, please make an another issue because this issue is going to be closed when I fix the bug.

akiradeveloper commented 7 years ago

@mraaman Could you please test with fix-171 branch?

mraaman commented 7 years ago

My cache-file was smaller this time (3146Mb) but still big enough to hold the data for the file.

Before copy

0 1953523120 writeboost 1653 798576 6288 14 13 0 10 1 630 0 0 0 0 0 0 0 10 0 0 0 4 0 0 6 10 writeback_threshold 0 nr_cur_batched_writeback 0 sync_data_interval 0 update_sb_record_interval 0 read_cache_threshold 0

wbstatus

cursor pos              = 1653
# of cache blocks       = 798576
# of segments           = 6288
current id              = 14
last flushed id         = 13
last writeback id       = 0
# of dirty cache blocks = 10
# of partial flushes    = 6
write? hit? on_buffer? fullsize?
0      0    0          0         1
0      0    0          1         630
0      0    1          0         0
0      0    1          1         0
0      1    0          0         0
0      1    0          1         0
0      1    1          0         0
0      1    1          1         0
1      0    0          0         0
1      0    0          1         10
1      0    1          0         0
1      0    1          1         0
1      1    0          0         0
1      1    0          1         4
1      1    1          0         0
1      1    1          1         0

After copy

0 1953523120 writeboost 392811 798576 6288 3094 3093 0 390418 1 645 0 0 0 0 0 0 0 390418 0 0 0 4 0 0 10 10 writeback_threshold 0 nr_cur_batched_writeback 0 sync_data_interval 0 update_sb_record_interval 0 read_cache_threshold 0

wbstatus

cursor pos              = 392811
# of cache blocks       = 798576
# of segments           = 6288
current id              = 3094
last flushed id         = 3093
last writeback id       = 0
# of dirty cache blocks = 390418
# of partial flushes    = 10
write? hit? on_buffer? fullsize?
0      0    0          0         1
0      0    0          1         645
0      0    1          0         0
0      0    1          1         0
0      1    0          0         0
0      1    0          1         0
0      1    1          0         0
0      1    1          1         0
1      0    0          0         0
1      0    0          1         390418
1      0    1          0         0
1      0    1          1         0
1      1    0          0         0
1      1    0          1         4
1      1    1          0         0
1      1    1          1         0

This time there were zero writes! Excellent!

However I could also see that the copy process was reading metadata from the backing disk during the copy. Do you know the reason for this? It read 14 times and 4096 bytes each time.

akiradeveloper commented 7 years ago

@mraaman

This time there were zero writes! Excellent!

Great.

However I could also see that the copy process was reading metadata from the backing disk during the copy. Do you know the reason for this? It read 14 times and 4096 bytes each time.

It depends on how you set up the system.

If you first make the filesystem on the backing device and stacks caching then, it may read filesystem metadata in copying. Copying is seen as only writing in the userland but it may trigger some metadata read inside the kernel. (e.g. read-modify-write)

But if you first stacks caching and then make the filesystem over the stacked wbdev then all the filesystem metadata should be on the caching device and then hit.

mraaman commented 7 years ago

Ok, great. I guess we can close this issue then?

akiradeveloper commented 7 years ago

Don't close by yourself. I am going to close this issue when I merge the fix.