SpanDB / SpanDB

A Fast, Cost-Effective LSM-tree Based KV Store on Hybrid Storage
GNU General Public License v2.0
190 stars 42 forks source link

Stucks at warmup with 100% CPU utilization #6

Open seekstar opened 2 years ago

seekstar commented 2 years ago

The configurations are as follows.

Environment

OS: Debian GNU/Linux 11 (bullseye) x86_64 
Host: NUC11PAHi7 M15513-302 
Kernel: 5.10.0-11-amd64 
CPU: 11th Gen Intel i7-1165G7 (8) @ 4.700GHz 
GPU: Intel TigerLake GT2 [Iris Xe Graphics] 
Memory: 31737MiB 

SPDK

I installed SPDK v20.01.2 following the guide of SpanDB, and reserved 4GB huge pages:

sudo HUGEMEM=3072 scripts/setup.sh

Output:

0000:01:00.0 (144d a80a): nvme -> vfio-pci

Current user memlock limit: 3967 MB

This is the maximum amount of memory you will be
able to use with DPDK and VFIO if run as current user.
To change this, please adjust limits.conf memlock limit for current user.

Workload

The default workloada is too large, so I created a smaller variant of workloada and saved as workloada_1e5_2e4.spec:

recordcount=100000
operationcount=20000
workload=com.yahoo.ycsb.workloads.CoreWorkload

readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0

requestdistribution=zipfian

fieldlength=1024

fieldcount=1

readallfields=true

I generate the DB with the command:

./test ../workloads/workloada_1e5_2e4.spec 40 /tmp/rocksdb /tmp/rocksdb_log 1 rocksdb /home/searchstar/db_bak/rocksdb_workloada_tmp/
mv /tmp/rocksdb /home/searchstar/db_bak/rocksdb_workloada_1e5_2e4_40

Output:

==================Workload=================
../workloads/workloada_1e5_2e4.spec
recordcount=100000
operationcount=20000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0
requestdistribution=zipfian
fieldlength=1024
fieldcount=1
readallfields=true
==========================================
empty the existing data folder
rm: cannot remove '/tmp/rocksdb/*': No such file or directory
empty the existing log folder
dbname: rocksdb
--------------memory usage----------------
               total        used        free      shared  buff/cache   available
Mem:            30Gi        21Gi       9.7Gi       2.0Mi       173Mi       9.6Gi
Swap:          976Mi        13Mi       963Mi
------------------------------------------
==================================================================
-----------configuration------------
Clients: 40
Auto configure: false
Dynamic moving: true
Max read queue: 1
WAL: 0, fsync: 0
Max level: 0
Data: /tmp/rocksdb
WAL: nologging
Max_write_buffer_number: 4
Max_background_jobs: 8
High-priority backgd threds: 2
Low-priority backgd threds: 6
Max_subcompactions: 4
Write_buffer_size: 1.000 GB
-------------------------------------
write done by self: 9151
WAL sync num: 0
WAL write: 0.000 GB
compaction read: 0.000 GB, compaction write: 0.000 GB
flush write: 0.000 GB
flush time: 0.000 s
Load 100000 requests in 0.493 seconds.
==================================================================

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sum      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Uptime(secs): 180.5 total, 180.5 interval
Flush(GB): cumulative 0.000, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **

** DB Stats **
Uptime(secs): 180.5 total, 180.5 interval
Cumulative writes: 100K writes, 100K keys, 9151 commit groups, 10.9 writes per commit group, ingest: 0.10 GB, 0.56 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 100K writes, 100K keys, 9151 commit groups, 10.9 writes per commit group, ingest: 100.37 MB, 0.56 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

compaction call: 0
flush call: 0
Low priority thrd 3 call: 0
Low priority thrd 1 call: 0
Low priority thrd 4 call: 0
Low priority thrd 2 call: 0
Low priority thrd 0 call: 0
Low priority thrd 5 call: 0
High priority thrd 1 call: 1
High priority thrd 0 call: 1

TopFS cache size

By default, the ycsb tester requires 90GB of cache, which exceeds the physical memory of my machine, so I modify ycsb/src/test.cc:

options.topfs_cache_size = 1;

And make to generate new test binary.

Stuck at warmup

I run the test as root:

mkdir /tmp/spandb
ulimit -n 100000
./test ../workloads/workloada_1e5_2e4.spec 8 /tmp/spandb/ 01:00.0 0 spandb /home/searchstar/db_bak/rocksdb_workloada_1e5_2e4_40/

Output:

==================Workload=================
../workloads/workloada_1e5_2e4.spec
recordcount=100000
operationcount=20000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0
requestdistribution=zipfian
fieldlength=1024
fieldcount=1
readallfields=true
==========================================
empty the existing data folder
rm: cannot remove '/tmp/spandb//*': No such file or directory
loading database from /home/searchstar/db_bak/rocksdb_workloada_1e5_2e4_40/ to /tmp/spandb/ 
loading finished
dbname: spandb
Starting SPDK v20.01.2 git sha1 b2808069e / DPDK 19.11.2 initialization...
[ DPDK EAL parameters: spdk -c 0x1 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --iova-mode=pa --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk0 --proc-type=auto ]
EAL: No available hugepages reported in hugepages-1048576kB
EAL: VFIO support initialized
nvme.c:1028:spdk_nvme_transport_id_parse: *ERROR*: Unknown transport ID key '01'
EAL:   using IOMMU type 1 (Type 1)
Attaching to 0000:01:00.0
Attached to 0000:01:00.0
num_io_queues: 129
Using controller Samsung SSD 980 PRO  (S5GXNS0RC11989E     ) with 1 namespaces.
Namespace ID: 1 size: 1000GB
Controller IO queue size: 512
total queue: 129, topfs queue: 20
SPDK memory allocation starts (1.00 GB)
...0.0%...18.8%...37.5%...56.2%...75.0%...93.8%...100%
SPDK memory allocation finished.
Multiple page-level cache:32
start migration...
Moved 101.46 MB
RocksDB recovery: 0.00 s
/home/searchstar/git/others/SpanDB/file/delete_scheduler.cc:74: delete /tmp/spandb/000009.log failed
--------------memory usage----------------
               total        used        free      shared  buff/cache   available
Mem:            30Gi       3.3Gi        27Gi       4.0Mi       115Mi        27Gi
Swap:          976Mi          0B       976Mi
------------------------------------------
loading workload (26001)...
10%...20%...30%...40%...50%...60%...70%...80%...90%...
loading workload finished in 0.018 s
Start warmup (6000)...
warmup start: 2022/02/14-13:14:27.444617
client coreid: 5
client coreid: 6
client coreid: 7
client coreid: 4
client coreid: 3
client coreid: 2
client coreid: 1
client coreid: 0

Then it stucks here, with 100% CPU usage.

What's strange is that the database directory seems to be deleted:

$ du -sh /tmp/spandb/
4.4M    /tmp/spandb/

For comparison:

$ du -sh /home/searchstar/db_bak/rocksdb_workloada_1e5_2e4_40/
102M    /home/searchstar/db_bak/rocksdb_workloada_1e5_2e4_40/

And there is also an error message in the output of test above:

/home/searchstar/git/others/SpanDB/file/delete_scheduler.cc:74: delete /tmp/spandb/000009.log failed

I don't know whether it matters.

cighao commented 2 years ago

Hi hi,

Is that fixed now? Could you try to use gdb or pstack to debug where it stucks?


"What's strange is that the database directory seems to be deleted:" This is because the data is moved to the raw NVMe SSD with SPDK, and removed from the native file system. From the output, you can find "start migration... Moved 101.46 MB", which means it's moving data from native FS to the raw NVMe SSD with SPDK.


"/home/searchstar/git/others/SpanDB/file/delete_scheduler.cc:74: delete /tmp/spandb/000009.log failed" This error could be ignored temporarily.

F280319492 commented 2 years ago

I'm also troubled by this problem. I initially think that this problem is caused by the function RequestScheduler::WorkerThread function not processing batch_write_queue. But I don't understand how to calculate the queue weights.

F280319492 commented 2 years ago

I'm also troubled by this problem. I initially think that this problem is caused by the function RequestScheduler::WorkerThread function not processing batch_write_queue. But I don't understand how to calculate the queue weights.

I also encountered the same problem with the read_queue during my testing

CCrainys commented 2 years ago

Hi,

  1. Can you use gdb or pstack to print the stack when your programs hangs? The information you provide now is limited
  2. From the output of seekstar, I found the error of nvme. Do you encounter the same problem? If so, I suggest that you should check the initialization process of nvme and spdk. image
seekstar commented 2 years ago

Sorry for my late response. In case that I miss something, I simply paste all output here:

https://gist.github.com/seekstar/2b7171b6904c22a24799261498583982

seekstar commented 2 years ago

I have no idea about this error:

nvme.c:1028:spdk_nvme_transport_id_parse: ERROR: Unknown transport ID key '01'

Here is the output of lspci:

01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/980PRO