facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.33k stars 6.29k forks source link

trying fillrandom on db_paths vector, but only 1st nvme is used #11432

Open gaowayne opened 1 year ago

gaowayne commented 1 year ago

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Expected behavior

if I configure two NVMe SSD devices with db_paths, RocksDB should start use 2nd NVMe SSD after 1st one is full.

Actual behavior

now, RocksDB report no space left error when 1st NVMe SSD is full.

[root@phobos rocksdb]# ./db_bench --num=60000000 --db=/mnt/nvme7n1p4/test1  --histogram=1 --key_size=4096 --value_size=8192 --compression_type=none --benchmarks="fillrandom,stats" --statistics --stats_per_interval=1 --stats_interval_seconds=240  --threads=1 --target_file_size_multiplier=10 --write_buffer_size=134217728  --use_existing_db=0 --disable_wal=false --cache_size=536870912 --bloom_bits=10 --bloom_locality=1 --compaction_style=0 --universal_max_size_amplification_percent=500 --max_write_buffer_number=16 --max_background_flushes=16  --level0_file_num_compaction_trigger=32 --level0_slowdown_writes_trigger=160 --level0_stop_writes_trigger=288 --soft_pending_compaction_bytes_limit=549755813888   --hard_pending_compaction_bytes_limit=1099511627776 --max_background_jobs=4 --max_background_compactions=4 --subcompactions=20
Set seed to 1683484533806993 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
wgao: DB::Open: /mnt/nvme7n1p4/test1 !
wgao: DB::Open KDBPath done!: /mnt/nvme3n1/test1 !
RocksDB:    version 8.2.0
Date:       Mon May  8 02:35:36 2023
CPU:        96 * Intel(R) Xeon(R) Platinum 8331C CPU @ 2.50GHz
CPUCache:   36864 KB
Keys:       4096 bytes each (+ 0 bytes user-defined timestamp)
Values:     8192 bytes each (4096 bytes after compression)
Entries:    60000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    703125.0 MB (estimated)
FileSize:   468750.0 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
wgao: DB::Open: /mnt/nvme7n1p4/test1 !
wgao: DB::Open KDBPath done!: /mnt/nvme3n1/test1 !
DB path: [/mnt/nvme7n1p4/test1]
put error: IO error: No space left on device: While appending to file: /mnt/nvme3n1/test1/062488.log: No space left on device
[root@phobos rocksdb]# 

Steps to reproduce the behavior

  1. prepare two NVMe SSD, create one small partition for 1st NVMe SSD, then configure db_paths as below

    
    } else {
    
            //
            //wayne test L0 wal on SLC case
            //
            //
            //fprintf(stdout, "wgao: DB::Open: %s !\n", db_name.c_str());
            fprintf(stdout, "wgao: DB::Open: %s !\n", db_name.c_str());
            std::string kDBPath = "/mnt/nvme3n1/test1";
            std::string kDBPath_1 = "/mnt/nvme7n1p4/test1";
            uint64_t uiL0size =  1024 * 1024 * 1024;
            uiL0size = 512 * uiL0size; //512G
            uint64_t uiLsize =  1024 * 1024 * 1024;
            uiLsize = 3*1024 * 1024 * uiLsize; // 3T
            options.db_paths.push_back({ kDBPath, (uint64_t)uiL0size }); ------------- 512G
            options.db_paths.push_back({ kDBPath_1, uiLsize }); ----------------------- 3T
            //options.db_paths.push_back({ kDBPath_2, 0 });     
            s = DB::Open(options, kDBPath, &db->db);
            fprintf(stdout, "wgao: DB::Open KDBPath done!: %s !\n", kDBPath.c_str());
            //below is old code wgao
            //s = DB::Open(options, db_name, &db->db);
    }
    if (FLAGS_report_open_timing) {
      std::cout << "OpenDb:     "
                << (FLAGS_env->NowNanos() - open_start) / 1000000.0
                << " milliseconds\n";
    }

2. then run below db_bench command

[root@phobos rocksdb]# ./db_bench --num=60000000 --db=/mnt/nvme7n1p4/test1 --histogram=1 --key_size=4096 --value_size=8192 --compression_type=none --benchmarks="fillrandom,stats" --statistics --stats_per_interval=1 --stats_interval_seconds=240 --threads=1 --target_file_size_multiplier=10 --write_buffer_size=134217728 --use_existing_db=0 --disable_wal=false --cache_size=536870912 --bloom_bits=10 --bloom_locality=1 --compaction_style=0 --universal_max_size_amplification_percent=500 --max_write_buffer_number=16 --max_background_flushes=16 --level0_file_num_compaction_trigger=32 --level0_slowdown_writes_trigger=160 --level0_stop_writes_trigger=288 --soft_pending_compaction_bytes_limit=549755813888 --hard_pending_compaction_bytes_limit=1099511627776 --max_background_jobs=4 --max_background_compactions=4 --subcompactions=20


3. you will see no space left on device on error as above. but actually 2nd NVMe device has nothing written on it. 
mdcallag commented 1 year ago

From your example above, uiLsize isn't 3T, it is 1024*3T

            uint64_t uiL0size =  1024 * 1024 * 1024;
            uiL0size = 512 * uiL0size; //512G
            uint64_t uiLsize =  1024 * 1024 * 1024;
            uiLsize = 3*1024 * 1024 * uiLsize; // 3T

I can't reproduce this after building from main.

rm -rf p1; rm -rf p2; mkdir p1; mkdir p2; ./db_bench --benchmarks=fillrandom,stats --num=1000000 --value_size=1000 --compression_type=none ; du -hs p1; du -hs p2 ; du -hs /tmp/rocksdbtest-2260

Output is:

628M    p1
193M    p2
40M     /tmp/rocksdbtest-2260
diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index 19ca1b4c0..2ac4c7842 100644
--- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4844,6 +4844,13 @@ class Benchmark {
             FLAGS_secondary_update_interval, db));
       }
     } else {
+      std::cout << "Open here\n";
+      std::string p1 = "./p1";
+      std::string p2 = "./p2";
+      uint64_t s1 =  1024ULL * 1024 * 1024 * 1;
+      uint64_t s2 =  1024ULL * 1024 * 1024 * 500;
+      options.db_paths.push_back({ p1, s1 });
+      options.db_paths.push_back({ p2, s2 });
       s = DB::Open(options, db_name, &db->db);
     }
     if (FLAGS_report_open_timing) {
gaowayne commented 1 year ago

stogram=1 --key_size=4096 --value_size=8192 --compression_type=none --benchmarks="fillrandom,stats" --statistics --stats_per_interval=1 --stats_interval_seconds=240 --threads=1 --target_file_size_multiplier=10 --write_buffer_size=134217728 --use_existing_db=0 --disable_wal=false --cache_size=536870912 --bloom_bits=10 --bloom_locality=1 --compaction_style=0 --universal_max_size_amplification_p

@mdcallag buddy, could you please try put p1 and p2 into two NVMe SSD? or if you do not have, you can create two partitions. and I see many unit test use small size, can you use my size to have a try? I am guessing RocksDB did not arrange levels correctly after size reach 3T level.

and also you mean latest code already fix this issue? here is the git last log I am testing with.

commit 760b773f58277f9ce449389c0773a1eee2d14363 (HEAD -> main, origin/main, origin/HEAD)
Author: Andrew Kryczka <andrewkr@fb.com>
Date:   Mon Apr 10 13:59:44 2023 -0700

    fix optimization-disabled test builds with platform010 (#11361)

    Summary:
    Fixed the following failure:
third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc: In function ‘bool testing::internal::StackGrowsDown()’:
third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8681:24: error: ‘dummy’ may be used uninitialized [-Werror=maybe-uninitialized]
 8681 |   StackLowerThanAddress(&dummy, &result);
      |   ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8671:13: note: by argument 1 of type ‘const void*’ to ‘void testing::internal::StackLowerThanAddress(const void*, bool*)’ declared here
 8671 | static void StackLowerThanAddress(const void* ptr, bool* result) {
      |             ^~~~~~~~~~~~~~~~~~~~~
third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:8679:7: note: ‘dummy’ declared here
 8679 |   int dummy;
      |       ^~~~~
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11361

Reviewed By: cbi42

Differential Revision: D44838033

Pulled By: ajkr

fbshipit-source-id: 27d68b5a24a15723bbaaa7de45ccd70a60fe259e
mdcallag commented 1 year ago

My tests used. Assuming there is a bug I doubt it has been fixed between the versions we used.

commit a5909f88641a1222865839e62c91e43e6ee36c03 (HEAD -> main, origin/main, origin/HEAD)
Author: Peter Dillinger <peterd@fb.com>
Date:   Thu May 4 12:41:28 2023 -0700

    Clarify io_activity (#11427)

I don't have a spare server with more than 1 SSD and I am not willing to create partitions on the single-SSD servers. What happens with your setup/test if you use two directories on the same partition?

mdcallag commented 1 year ago

Also, what happens if you fix the math so that uiLsize is really 3T. The current code shouldn't overflow on uint64_t but the value is much larger than 3T. I am suggesting that you remove one of the multiply by 1024 terms: uiLsize = 31024 1024 uiLsize; // 3T -> uiLsize = 31024 * uiLsize; // 3T

            uint64_t uiL0size =  1024 * 1024 * 1024;
            uiL0size = 512 * uiL0size; //512G
            uint64_t uiLsize =  1024 * 1024 * 1024;
            uiLsize = 3*1024 * 1024 * uiLsize; // 3T
            options.db_paths.push_back({ kDBPath, (uint64_t)uiL0size }); ------------- 512G
            options.db_paths.push_back({ kDBPath_1, uiLsize }); ----------------------- 3T
gaowayne commented 1 year ago

My tests used. Assuming there is a bug I doubt it has been fixed between the versions we used.

commit a5909f88641a1222865839e62c91e43e6ee36c03 (HEAD -> main, origin/main, origin/HEAD)
Author: Peter Dillinger <peterd@fb.com>
Date:   Thu May 4 12:41:28 2023 -0700

    Clarify io_activity (#11427)

I don't have a spare server with more than 1 SSD and I am not willing to create partitions on the single-SSD servers. What happens with your setup/test if you use two directories on the same partition?

I will take the AR to verify this from my side, and update you. :), I will also verify two folder on the same patitions cases and give update. :)

gaowayne commented 1 year ago

Also, what happens if you fix the math so that uiLsize is really 3T. The current code shouldn't overflow on uint64_t but the value is much larger than 3T. I am suggesting that you remove one of the multiply by 1024 terms: uiLsize = 3_1024 1024 uiLsize; // 3T -> uiLsize = 3_1024 * uiLsize; // 3T

            uint64_t uiL0size =  1024 * 1024 * 1024;
            uiL0size = 512 * uiL0size; //512G
            uint64_t uiLsize =  1024 * 1024 * 1024;
            uiLsize = 3*1024 * 1024 * uiLsize; // 3T
            options.db_paths.push_back({ kDBPath, (uint64_t)uiL0size }); ------------- 512G
            options.db_paths.push_back({ kDBPath_1, uiLsize }); ----------------------- 3T

ah, Yes, good catch, my poor match, that is far more 3T. :) let me try this first. I will give update.

gaowayne commented 1 year ago

@mdcallag buddy you are right, this is my bad. my bad math. after I correct 3T issue, now two NVMe SSD shows iostat BW. 2nd NVMe is used correctly. but its BW is super slow. one NVMe SSD, we can reach 1200MB/s but with two, each only have 350MB/s

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme7n1p4        0.00      0.00     0.00   0.00    0.00     0.00 3149.00    357.74     0.00   0.00   10.27   116.33    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   32.34  11.44
nvme3n1       5507.40    343.86     0.00   0.00    0.06    63.93 3187.60    358.08     0.00   0.00    9.51   115.03    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   30.67  51.60
nvme6n1          0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
gaowayne commented 1 year ago

I will download latest code to double check this BW perf issue

gaowayne commented 1 year ago

./db_bench --num=60000000 --db=/mnt/nvme7n1p4/test1 --histogram=1 --key_size=4096 --value_size=8192 --compression_type=none --benchmarks="fillrandom,stats" --statistics --stats_per_interval=1 --stats_interval_seconds=240 --threads=1 --target_file_size_multiplier=10 --write_buffer_size=134217728 --use_existing_db=0 --disable_wal=false --cache_size=536870912 --bloom_bits=10 --bloom_locality=1 --compaction_style=0 --universal_max_size_amplification_percent=500 --max_write_buffer_number=16 --max_background_flushes=16 --level0_file_num_compaction_trigger=32 --level0_slowdown_writes_trigger=160 --level0_stop_writes_trigger=288 --soft_pending_compaction_bytes_limit=549755813888 --hard_pending_compaction_bytes_limit=1099511627776 --max_background_jobs=4 --max_background_compactions=4 --subcompactions=20

sorry man, stay tuned. I found I should test fillrandom to reproduce this. fillseq works fine always.

gaowayne commented 1 year ago

@mdcallag hello man, I confirmed, even I fixed 3T math problem and use latest code, when I fillrandom, 2nd drive is not used too!~

ajkr commented 1 year ago

if I configure two NVMe SSD devices with db_paths, RocksDB should start use 2nd NVMe SSD after 1st one is full.

That is not quite the expected behavior. How much space does "/mnt/nvme3n1/test1/" consume when you get an out-of-space error? It looks like the expected size of the DB (FileSize: 468750.0 MB (estimated)) is a bit less than the size limit you are giving to "/mnt/nvme3n1/test1" (512GB), in which case we would expect the DB to reside in that one directory.

gaowayne commented 1 year ago

[root@phobos rocksdb]# ./db_bench --num=60000000 --db=/mnt/nvme7n1p4

hello here is my nvme3n1 actual consumed size

/dev/nvme3n1                    745G  475G  271G  64% /mnt/nvme3n1
/dev/nvme7n1p4                  3.5T   25G  3.5T   1% /mnt/nvme7n1p4
[root@phobos mnt]# cd nvme3n1
[root@phobos nvme3n1]# ls
test1
[root@phobos nvme3n1]# du test1
491708712   test1
[root@phobos nvme3n1]# du test1 -h
469G    test1
[root@phobos nvme3n1]# 

I am pretty sure, the same db_bench command line, I can ingest at least 2T data if I do not hack db_paths. I feel the estimate part is not correct too based two db_paths I give.

ajkr commented 1 year ago
du test1 -h
469G    test1

This is still below the configured target (512GB) though. So RocksDB has respected the config, at least up until this point.

What does it look like when you run out of space? It looks like you have 271GB space available, so this doesn't look like the problematic scenario described earlier.

gaowayne commented 1 year ago

@ajkr I test again, now I increase k-v numbers,

./db_bench --num=120000000 --db=/mnt/nvme7n1p4/test1  --histogram=1 --key_size=4096 --value_size=8192 --compression_type=none --benchmarks="fillrandom,stats" --statistics --stats_per_interval=1 --stats_interval_seconds=60  --threads=1 --target_file_size_multiplier=10 --write_buffer_size=134217728  --use_existing_db=0 --disable_wal=false --cache_size=536870912 --bloom_bits=10 --bloom_locality=1 --compaction_style=0 --universal_max_size_amplification_percent=500 --max_write_buffer_number=16 --max_background_flushes=16  --level0_file_num_compaction_trigger=32 --level0_slowdown_writes_trigger=160 --level0_stop_writes_trigger=288 --soft_pending_compaction_bytes_limit=549755813888   --hard_pending_compaction_bytes_limit=1099511627776 --max_background_jobs=4 --max_background_compactions=4 --subcompactions=20

it is above 512G I specified in the code.

[root@phobos nvme3n1]# du -h test1/
688G    test1/
[root@phobos nvme3n1]# ls
test1
[root@phobos nvme3n1]# cd /mnt/nvme7n1p4
[root@phobos nvme7n1p4]# ls
test1
[root@phobos nvme7n1p4]# cd test1
[root@phobos test1]# ls
[root@phobos test1]# du -h test1/
du: cannot access 'test1/': No such file or directory
[root@phobos test1]# cd ..
[root@phobos nvme7n1p4]# du -h test1/
0   test1/
[root@phobos nvme7n1p4]# 

I still see it report no space on nvme3n1, and nvme7n1 is empty, why it cannot automatically start use 2nd nvme drive I specified in random workload? seq workload works fine.

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   383.20 MB   1.5    104.0     0.0    104.0     667.7    563.6       0.0   1.2     78.1    501.0   1364.65            610.06      4659    0.293   9055K   2811       0.0       0.0
  L1     37/0    2.57 GB  10.3   1018.7   563.3    455.5    1009.6    554.2       0.0   1.8   3084.4   3056.8    338.22           1442.45        52    6.504     88M   795K       0.0       0.0
  L2    141/0   26.73 GB  10.7   1351.5   498.4    853.1    1333.4    480.3      53.2   2.7    490.5    484.0   2821.24           1686.42      5419    0.521    117M  1578K       0.0       0.0
  L3    161/4   279.64 GB  10.5   1370.4   500.6    869.8    1310.0    440.2       6.1   2.6    372.0    355.6   3772.55           1758.80       351   10.748    119M  5259K       0.0       0.0
  L4     78/0   166.71 GB   0.7    143.3   143.3      0.0     143.3    143.3      23.4   1.0    421.0    421.0    348.63            189.05        27   12.912     12M      0       0.0       0.0
 Sum    420/4   476.02 GB   0.0   3988.0  1705.7   2282.4    4464.1   2181.7      82.7   7.9    472.4    528.8   8645.30           5686.78     10508    0.823    347M  7636K       0.0       0.0
 Int      0/0    0.00 KB   0.0    425.3   221.6    203.7     431.9    228.2      25.7  28.9    457.9    465.0    951.14            572.52       473    2.011     37M   725K       0.0       0.0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0   3988.0  1705.7   2282.4    3900.4   1618.0       0.0   0.0    548.6    536.5   7444.23           5193.13      5989    1.243    347M  7636K       0.0       0.0
High      0/0    0.00 KB   0.0      0.0     0.0      0.0     563.7    563.7       0.0   0.0      0.0    480.6   1201.06            493.65      4519    0.266       0      0       0.0       0.0

Blob file count: 0, total size: 0.0 GB, garbage size: 0.0 GB, space amp: 0.0

Uptime(secs): 2036.4 total, 233.3 interval
Flush(GB): cumulative 563.667, interval 14.968
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 4464.07 GB write, 2244.80 MB/s write, 3988.05 GB read, 2005.43 MB/s read, 8645.3 seconds
Interval compaction: 431.92 GB write, 1895.52 MB/s write, 425.28 GB read, 1866.38 MB/s read, 951.1 seconds
Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 277, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 1298, l0-file-count-limit-stops: 0, memtable-limit-delays: 10, memtable-limit-stops: 0, pending-compaction-bytes-delays: 3353, pending-compaction-bytes-stops: 0, total-delays: 4661, total-stops: 0, interval: 459 total count
Block cache LRUCache@0x29b3a40#925622 capacity: 512.00 MB usage: 85.48 KB table_size: 8192 occupancy: 8 collections: 4 last_copies: 0 last_secs: 0.0003 secs_since: 233
(     870,    1300 ]        3   0.000% 100.437% 
(    1300,    1900 ]        3   0.000% 100.437% 
(    1900,    2900 ]        1   0.000% 100.437% 

** Level 1 read latency histogram (micros):
Count: 87444911 Average: 2.7148  StdDev: 20.32
Min: 0  Median: 1.8997  Max: 30068
Percentiles: P50: 1.90 P75: 2.47 P99: 3.82 P99.9: 5.75 P99.99: 6.64
------------------------------------------------------
[       0,       1 ]   918170   1.050%   1.050% 
(       1,       2 ] 47575913  54.407%  55.457% ###########
(       2,       3 ] 36373723  41.596%  97.053% ########
(       3,       4 ]  2079374   2.378%  99.431% 
(       4,       6 ]   469800   0.537%  99.968% 
(       6,      10 ]   120701   0.138% 100.106% 
(      10,      15 ]     7193   0.008% 100.114% 
(      15,      22 ]     7498   0.009% 100.123% 
(      22,      34 ]    11944   0.014% 100.137% 
(      34,      51 ]     5981   0.007% 100.143% 
(      51,      76 ]     5388   0.006% 100.150% 
(      76,     110 ]     2889   0.003% 100.153% 
(     110,     170 ]     3336   0.004% 100.157% 
(     170,     250 ]     2878   0.003% 100.160% 
(     250,     380 ]     3887   0.004% 100.164% 
(     380,     580 ]     5375   0.006% 100.171% 
(     580,     870 ]     6682   0.008% 100.178% 
(     870,    1300 ]     4750   0.005% 100.184% 
(    1300,    1900 ]     1552   0.002% 100.185% 
(    1900,    2900 ]      446   0.001% 100.186% 
(    2900,    4400 ]      130   0.000% 100.186% 
(    4400,    6600 ]       50   0.000% 100.186% 
(    6600,    9900 ]       16   0.000% 100.186% 
(    9900,   14000 ]       20   0.000% 100.186% 
(   14000,   22000 ]       32   0.000% 100.186% 
(   22000,   33000 ]        5   0.000% 100.186% 

** Level 2 read latency histogram (micros):
Count: 113597193 Average: 8.4339  StdDev: 111.07
Min: 0  Median: 1.8722  Max: 415425
Percentiles: P50: 1.87 P75: 2.52 P99: 111.74 P99.9: 735.87 P99.99: 872.61
------------------------------------------------------
[       0,       1 ]  7279171   6.408%   6.408% #
(       1,       2 ] 56775984  49.980%  56.388% ##########
(       2,       3 ] 40709097  35.836%  92.224% #######
(       3,       4 ]  3596922   3.166%  95.391% #
(       4,       6 ]  1089665   0.959%  96.350% 
(       6,      10 ]   256388   0.226%  96.576% 
(      10,      15 ]    31991   0.028%  96.604% 
(      15,      22 ]   612694   0.539%  97.143% 
(      22,      34 ]  1019097   0.897%  98.040% 
(      34,      51 ]   341445   0.301%  98.341% 
(      51,      76 ]   335042   0.295%  98.636% 
(      76,     110 ]   403352   0.355%  98.991% 
(     110,     170 ]   358376   0.315%  99.306% 
(     170,     250 ]   214442   0.189%  99.495% 
(     250,     380 ]   156578   0.138%  99.633% 
(     380,     580 ]   185371   0.163%  99.796% 
(     580,     870 ]   219503  
(       4,       6 ]   918044   1.037%  91.307% 
(       6,      10 ]   147145   0.166%  91.473% 
(      10,      15 ]    18311   0.021%  91.494% 
(      15,      22 ]  1019198   1.151%  92.645% 
(      22,      34 ]  1850674   2.090%  94.736% 
(      34,      51 ]   551989   0.624%  95.359% 
(      51,      76 ]   592078   0.669%  96.028% 
(      76,     110 ]   913079   1.031%  97.059% 
(     110,     170 ]   852107   0.963%  98.022% 
(     170,     250 ]   438942   0.496%  98.518% 
(     250,     380 ]   291378   0.329%  98.847% 
(     380,     580 ]   363353   0.410%  99.257% 
(     580,     870 ]   429572   0.485%  99.742% 
(     870,    1300 ]   236889   0.268% 100.010% 
(    1300,    1900 ]    55465   0.063% 100.073% 
(    1900,    2900 ]     9113   0.010% 100.083% 
(    2900,    4400 ]     1826   0.002% 100.085% 
(    4400,    6600 ]      866   0.001% 100.086% 
(    6600,    9900 ]      708   0.001% 100.087% 
(    9900,   14000 ]      823   0.001% 100.088% 
(   14000,   22000 ]     1923   0.002% 100.090% 
(   22000,   33000 ]      124   0.000% 100.090% 
(   33000,   50000 ]        6   0.000% 100.090% 
(   50000,   75000 ]       38   0.000% 100.090% 
(   75000,  110000 ]       27   0.000% 100.090% 
(  110000,  170000 ]       22   0.000% 100.090% 
(  170000,  250000 ]        1   0.000% 100.090% 

** Level 4 read latency histogram (micros):
Count: 135 Average: 488.1185  StdDev: 1009.79
Min: 0  Median: 4.3571  Max: 6165
Percentiles: P50: 4.36 P75: 158.16 P99: 3875.00 P99.9: 6165.00 P99.99: 6165.00
------------------------------------------------------
[       0,       1 ]       47  34.815%  34.815% #######
(       1,       2 ]        7   5.185%  40.000% #
(       2,       3 ]        1   0.741%  40.741% 
(       3,       4 ]       10   7.407%  48.148% #
(       4,       6 ]       14  10.370%  58.519% ##
(       6,      10 ]        1   0.741%  59.259% 
(      10,      15 ]        1   0.741%  60.000% 
(      34,      51 ]        1   0.741%  60.741% 
(      51,      76 ]        4   2.963%  63.704% #
(     110,     170 ]       19  14.074%  77.778% ###
(   50000,   75000 ]       57   0.000% 100.065% 
(   75000,  110000 ]       54   0.000% 100.065% 
(  110000,  170000 ]       34   0.000% 100.065% 
(  170000,  250000 ]        5   0.000% 100.065% 
(  250000,  380000 ]        1   0.000% 100.065% 

** Level 4 read latency histogram (micros):
Count: 66253371 Average: 19.2299  StdDev: 205.26
Min: 0  Median: 1.4686  Max: 287704
Percentiles: P50: 1.47 P75: 1.87 P99: 530.47 P99.9: 1221.65 P99.99: 1718.20
------------------------------------------------------
[       0,       1 ] 13621811  20.560%  20.560% ####
(       1,       2 ] 41625941  62.828%  83.389% #############
(       2,       3 ]  3053053   4.608%  87.997% #
(       3,       4 ]   818512   1.235%  89.232% 
(       4,       6 ]   796751   1.203%  90.435% 
(       6,      10 ]   100794   0.152%  90.587% 
(      10,      15 ]    12957   0.020%  90.606% 
(      15,      22 ]   915991   1.383%  91.989% 
(      22,      34 ]  1423245   2.148%  94.137% 
(      34,      51 ]   440064   0.664%  94.801% 
(      51,      76 ]   547668   0.827%  95.628% 
(      76,     110 ]   740871   1.118%  96.746% 
(     110,     170 ]   591639   0.893%  97.639% 
(     170,     250 ]   370863   0.560%  98.199% 
(     250,     380 ]   290045   0.438%  98.637% 
(     380,     580 ]   319832   0.483%  99.120% 
(     580,     870 ]   363842   0.549%  99.669% 
(     870,    1300 ]   187383   0.283%  99.952% 
(    1300,    1900 ]    36562   0.055% 100.007% 
(    1900,    2900 ]     4626   0.007% 100.014% 
(    2900,    4400 ]     1041   0.002% 100.015% 
(    4400,    6600 ]      441   0.001% 100.016% 
(    6600,    9900 ]      391   0.001% 100.017% 
(    9900,   14000 ]      695   0.001% 100.018% 
(   14000,   22000 ]     1733   0.003% 100.020% 
(   22000,   33000 ]      252   0.000% 100.021% 
(   33000,   50000 ]       54   0.000% 100.021% 
(   50000,   75000 ]       43   0.000% 100.021% 
(   75000,  110000 ]       16   0.000% 100.021% 
(  110000,  170000 ]       17   0.000% 100.021% 
(  170000,  250000 ]        7   0.000% 100.021% 
(  250000,  380000 ]        5   0.000% 100.021% 

** DB Stats **
Uptime(secs): 4466.6 total, 60.0 interval
Cumulative writes: 76M writes, 76M keys, 76M commit groups, 1.0 writes per commit group, ingest: 877.86 GB, 201.25 MB/s
Cumulative WAL: 76M writes, 0 syncs, 76603000.00 writes per sync, written: 877.86 GB, 201.25 MB/s
Cumulative stall: 01:00:55.317 H:M:S, 81.8 percent
Interval writes: 629K writes, 629K keys, 629K commit groups, 1.0 writes per commit group, ingest: 7381.29 MB, 123.01 MB/s
Interval WAL: 629K writes, 0 syncs, 629000.00 writes per sync, written: 7.21 GB, 123.01 MB/s
Interval stall: 00:00:52.258 H:M:S, 87.1 percent
Write Stall (count): write-buffer-manager-limit-stops: 0, num-running-compactions: 4
num-running-flushes: 2

put error: IO error: No space left on device: While appending to file: /mnt/nvme3n1/test1/062914.log: No space left on device
ajkr commented 1 year ago

Sorry I have not had time to look more. Have you checked whether the feature is completely broken for you? For example, if you configure options.db_paths[0] to be very small, say 1GB, will fillrandom make use of both drives?

gaowayne commented 1 year ago

Sorry I have not had time to look more. Have you checked whether the feature is completely broken for you? For example, if you configure options.db_paths[0] to be very small, say 1GB, will fillrandom make use of both drives?

thank you so much, I will try this and update you. :)

gaowayne commented 1 year ago

@ajkr I tried make db_paths[0] as 1GB, it is interesting that. the SLC nvme device and QLC nvme device both have BW 1100MB/s, the sst log files are all written into SLC, and other SST files are saved in QLC NVMe SSD. this follows my expection. but why db_path[0] with bigger size not working? SLC files snapshot

[root@phobos test1]# ls -l
total 2236468
-rw-r--r-- 1 root root 133949409 May 26 01:43 028243.sst
-rw-r--r-- 1 root root 133949409 May 26 01:43 028246.sst
-rw-r--r-- 1 root root 133949408 May 26 01:43 028249.sst
-rw-r--r-- 1 root root 133949408 May 26 01:43 028252.sst
-rw-r--r-- 1 root root 133949408 May 26 01:43 028255.sst
-rw-r--r-- 1 root root 133949409 May 26 01:43 028258.sst
-rw-r--r-- 1 root root 133949409 May 26 01:43 028261.sst
-rw-r--r-- 1 root root 133724568 May 26 01:43 028263.log
-rw-r--r-- 1 root root 133949408 May 26 01:43 028264.sst
-rw-r--r-- 1 root root 133724568 May 26 01:43 028266.log
-rw-r--r-- 1 root root 133949409 May 26 01:43 028267.sst
-rw-r--r-- 1 root root  34579481 May 26 01:43 028269.log
-rw-r--r-- 1 root root  58614640 May 26 01:43 028270.sst
-rw-r--r-- 1 root root        16 May 26 01:25 CURRENT
-rw-r--r-- 1 root root        36 May 26 01:25 IDENTITY
-rw-r--r-- 1 root root         0 May 26 01:25 LOCK
-rw-r--r-- 1 root root  97012782 May 26 01:43 LOG
-rw-r--r-- 1 root root     36449 May 26 01:25 LOG.old.1685035525334064
-rw-r--r-- 1 root root 438525404 May 26 01:43 MANIFEST-000009
-rw-r--r-- 1 root root      7090 May 26 01:25 OPTIONS-000007
-rw-r--r-- 1 root root      7111 May 26 01:25 OPTIONS-000011
[root@phobos test1]# 

iostat nvme3n1 is SLC, nvme7n1 is QLC

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util

nvme7n1          0.00      0.00     0.00   0.00    0.00     0.00 8851.20   1022.08     0.00   0.00   10.28   118.25    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   91.00  30.94

nvme7n1p4        0.00      0.00     0.00   0.00    0.00     0.00 8851.20   1022.08     0.00   0.00   10.28   118.25    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   91.00  30.94

nvme3n1          0.00      0.00     0.00   0.00    0.00     0.00 8905.00   1022.68     0.00   0.00    9.72   117.60    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   86.52  30.30
ajkr commented 1 year ago

I see. Then we probably just do not adhere strictly enough to the configured limits. We should take a look and see if we can improve it for users who set limits close to their available space. I don't think it's something we'll get to in the near-term, but if you are interested, please feel free to see if there's any way to improve the db_paths/cf_paths heuristics.