akiradeveloper / dm-writeboost

Log-structured Caching for Linux
GNU General Public License v2.0
120 stars 18 forks source link

Very slow once cache completely full #239

Open mocksoul opened 1 year ago

mocksoul commented 1 year ago

In my setup:

1 hdd ("slow"), can write ~150mb/s directly another hdd ("fast") can read 250+ mb/s

I have ssd cache (50GiB) via dmwb in between

(writeback 100, nr_max_batched_writeback 32)

During large files copy from fast hdd to slow with dmwb:

  1. it sits around 150mb/s writing to slow hdd while cache filling up (due to much faster "source" hdd)
  2. once cache fills completely speed will be around 35-40mb/s

If I do temporary stop copy process, all data in cache will be written back in full 150mb/s. If cache fills again (after copy being resumed) - it will once again drop to ~1/3 of hdd bandswitch.

Also I did tried to set writeback thershold to 0. It fill cache without any writes to slow hdd (this is expected, I guess). But once cache is filled completely - it will slowly write to "slow" hdd with ~1/3 (45MiB/s) of its max speed.

So, I guess there some is some very not-optimal thing happens if cache is completely saturated with dirty data. Or maybe im doing something wrong? :)

This is the only bad case I found yet. Great library!!

p.s. suspend/resume wb dm device quickly (100 times/sec) not allowing cache to fill more than 95% works -- transfer rate will sit around 120-130 in that case for me :-D

mocksoul commented 1 year ago

follow up

despite I'm always set writeback_threshold to 100 and nr_max_batched_writeback to 32, I see on slow disk (iostat):

  1. 85..99 util during reads
  2. 70..72 util during seq writes while cache filling
  3. 25..35 util if cache filled up (this is described in ticket itself)

Point 1 looks fine.

About Point 2: initially I thought this is due to SSD being on 100% util itself and thats why simultaneous reads from it during writeback is quite slow. But later I saw writeback speed dont change at all if I stop filling cache untill cache is completely written back to slow HDD. So, this is code optimisation issue I guess.

btw, filsystem used in all "tests" is BTRFS with dup metadata and single data profiles.

akiradeveloper commented 1 year ago

@mocksoul

You are doing a sequential copy from HDD1 to dmwb and you see very a slow write on HDD2 when the cache is saturated.

graph TD
  subgraph  dmwb
  HDD2(HDD2 150MB/s)
  SSD
  SSD --> HDD2
  end
  HDD1(HDD1 250MB/s) --> SSD

This should be because of https://github.com/akiradeveloper/dm-writeboost/commit/e3c98a6dca6395bbec5f0781d05972cd9b4e03ab.

You can see that the number of segment in one writeback is not constant but is adaptively changed based on the situation.

static u32 calc_nr_writeback(struct wb_device *wb)
{
    u32 nr_writeback_candidates =
        atomic64_read(&wb->last_flushed_segment_id)
        - atomic64_read(&wb->last_writeback_segment_id);

    u32 nr_max_batch = read_once(wb->nr_max_batched_writeback);
    if (wb->nr_writeback_segs != nr_max_batch)
        try_alloc_writeback_ios(wb, nr_max_batch, GFP_NOIO | __GFP_NOWARN);

    return min3(nr_writeback_candidates, wb->nr_writeback_segs, wb->nr_empty_segs + 1);
}
mocksoul commented 1 year ago

You are doing a sequential copy from HDD1 to dmwb and you see very a slow write on HDD2 when the cache is saturated.

yep, this is what I see.

This should be because of https://github.com/akiradeveloper/dm-writeboost/commit/e3c98a6dca6395bbec5f0781d05972cd9b4e03ab.

you want me to try dmwb without that commit?

but is adaptively changed based on the situation.

I'm not sure i understood correctly - you mean this is expected behaviour? And writes to HDD2 should stagger in this case?

akiradeveloper commented 1 year ago

you want me to try dmwb without that commit?

No.

I'm not sure i understood correctly - you mean this is expected behaviour? And writes to HDD2 should stagger in this case?

Yes. It is an expected behavior.

The use case of writeboost is not the sequential writes which is rather artificial. The intention of the commit is throttling the writeback when there are not enough space in the cache device because cache device should allocate new empty segment as soon as possible in such saturation. If we don't have this throttling, the write to the SSD will wait until all 32 segments are written back. This may cause the upper layer timeout.

For me, 1/3 of the max throughput in such worst case sounds enough good.

mocksoul commented 1 year ago

I see. Probably that could be made at least tunable. Because for my scenario "write to the SSD will wait" is completely fine in this case, because it happens only if we write much faster than HDD2 can handle and saturate cache completely.

My scenario is: often small random writes + occasional big almost-sequential writes. Random writes do not fill cache completely and speedup things a lot and DMWB shines here. But when big sequential writes happen it is unusable for me in current form.

Upper layer timeouts are completely fine, because they are tunable in linux vfs.

Thnx for tipping code piece, i'll try to hack and post results here ;).