earlephilhower / ezfio

Simple NVME/SAS/SATA SSD test framework for Linux and Windows
GNU General Public License v2.0
165 stars 52 forks source link

error in test of many nodes and nvme #47

Closed JosephNew closed 4 years ago

JosephNew commented 4 years ago
JosephNew commented 4 years ago

Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=512 ERROR ERROR ERROR ERROR DETECTED, ABORTING TEST RUN.

earlephilhower commented 4 years ago

I'm sorry, but I don't quite understand your issue.

I changed ezfio for my last employer and we were extensively using the cluster mode (because it was a NVMEoF array company and we were testing perf of 10s of clients) without incident. Make sure you have started the fio server on the remotes or it won't connect (and it can't start them by itself).

Also, your last message is saying your SSDs aren't supporting 512b accesses. The script checks the Linux /sys filesystem to get the reported min IO size, and if the test is smaller than it, will skip. So I'd check that you're reporting the proper min IO size and not the default 512b on your product.

JosephNew commented 4 years ago

This is the pass cmd

python ezfio.py --cluster -d node31:/dev/nvme2n1,node32:/dev/nvme3n1,node33:/dev/nvme2n1

This also pass in one client

python ezfio.py  -d node31:/dev/nvme2n1,dev/nvme3n1,dev/nvme4n1

This is the fail cmd

JosephNew commented 4 years ago

ezFio test parameters:

           Drive: node31:/dev/nvme1n1,node32:/dev/nvme1n1,node33:/dev/nvme0n1
           Model: SUZAKU
          Serial: DIR0103000
   AvailCapacity: 1024 GiB
  TestedCapacity: 1024 GiB
    TestedOffset: 0 GiB
             CPU: Intel Xeon CPU E5-2650 v2 @ 2.60GHz
           Cores: 16
       Frequency: 2600
     FIO Version: fio-3.20-38-g14060-dirty

Test Description BW(MB/s) IOPS Lat(us)


---Sequential Preconditioning---
Sequential Preconditioning Pass 1 DONE DONE DONE Sequential Preconditioning Pass 2 DONE DONE DONE

---Sustained Multi-Threaded Sequential Read Tests by Block Size---
Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=512 72.40 148,268 5153.2 Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=1024 200.06 204,860 3746.7 Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=2048 558.85 286,130 2680.1 Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=4096 1,502.53 384,647 1922.3 Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=8192 3,107.85 397,805 1902.8 Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=16384 5,882.61 376,487 1885.9 Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=32768 6,203.94 198,526 3868.6 Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=65536 6,229.46 99,671 6438.2 Sustained Multi-Threaded Sequential Read Tests by Block Size, BS=131072 6,246.78 49,974 12840.8

---Sustained Multi-Threaded Random Read Tests by Block Size---
Sustained Multi-Threaded Random Read Tests by Block Size, BS=512 950.00 1,945,607 433.4 Sustained Multi-Threaded Random Read Tests by Block Size, BS=1024 2,032.06 2,080,834 371.0 Sustained Multi-Threaded Random Read Tests by Block Size, BS=2048 4,029.06 2,062,880 373.4 Sustained Multi-Threaded Random Read Tests by Block Size, BS=4096 5,981.07 1,531,155 498.4 Sustained Multi-Threaded Random Read Tests by Block Size, BS=8192 6,109.33 781,994 991.7 Sustained Multi-Threaded Random Read Tests by Block Size, BS=16384 6,190.74 396,208 1941.7 Sustained Multi-Threaded Random Read Tests by Block Size, BS=32768 6,226.39 199,245 3863.3 Sustained Multi-Threaded Random Read Tests by Block Size, BS=65536 00:01:18


- I'll do multi-disk and multi-node test later,and update the result 
JosephNew commented 4 years ago

STDERR:

Exception in thread Thread-35: Traceback (most recent call last): File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib64/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "ezfio.py", line 1120, in JobWrapper val = o'cmdline' File "ezfio.py", line 974, in o['runtime'])}) File "ezfio.py", line 862, in RunTest raise FIOError(" ".join(cmdline), code, err, out) FIOError

Sustained 4KB Random Read Tests by Number of Threads, Threads=8 ERROR ERROR ERROR ERROR DETECTED, ABORTING TEST RUN.

earlephilhower commented 4 years ago

FIO crashed for some reason. If the drive dropped offline during the run, or there was a network/HW error, it can do that. Check the FIO logs to get the exact message from FIO.

JosephNew commented 4 years ago

FIO crashed for some reason. If the drive dropped offline during the run, or there was a network/HW error, it can do that. Check the FIO logs to get the exact message from FIO.

JosephNew commented 4 years ago
earlephilhower commented 4 years ago

python3 ezfio.py --cluster -d node31:/dev/nvme2n1,/dev/nvme3n1,node32:/dev/nvme2n1,/dev/nvme3n1,node33:/dev/nvme2n1,/dev/nvme3n1

You need nodenames before each devnode. node31:/dev/a,node31:/dev/b,node32:/dev/a,node32:/dev/b...

JosephNew commented 4 years ago

python3 ezfio.py --cluster -d node31:/dev/nvme2n1,/dev/nvme3n1,node32:/dev/nvme2n1,/dev/nvme3n1,node33:/dev/nvme2n1,/dev/nvme3n1

You need nodenames before each devnode. node31:/dev/a,node31:/dev/b,node32:/dev/a,node32:/dev/b...

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 1.00 0.00 4.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme2n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme3n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme4n1 0.00 0.00 0.00 22259.00 0.00 2849152.00 256.00 62.11 2.79 0.00 2.79 0.04 100.00 nvme5n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

- For another compare.``` python3 ezfio.py  --drive /dev/nvme1n1,/dev/nvme2n1```can push I/O in all the disk
- But only in one node
```bash
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.76    0.00    4.71    0.00    0.00   91.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme0n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme1n1           0.00     0.00    0.00 11552.00     0.00 1478656.00   256.00    63.54    5.49    0.00    5.49   0.09 100.00
nvme2n1           0.00     0.00    0.00 11652.00     0.00 1491456.00   256.00    63.25    5.43    0.00    5.43   0.09 100.00
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme4n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme5n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme6n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme7n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme8n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme9n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme10n1          0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme11n1          0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme12n1          0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
JosephNew commented 4 years ago

FIO crashed for some reason. If the drive dropped offline during the run, or there was a network/HW error, it can do that. Check the FIO logs to get the exact message from FIO.

-#define MAX_FILELOCKS 1024 +#define MAX_FILELOCKS 8192

static struct filelock_data { struct flist_head list; diff --git a/fio.h b/fio.h index 8045c32..4ad19ba 100644 --- a/fio.h +++ b/fio.h @@ -556,7 +556,7 @@ static inline void fio_ro_check(const struct thre !(io_u->ddir == DDIR_TRIM && !td_trim(td))); }

-#define REAL_MAX_JOBS 4096 +#define REAL_MAX_JOBS 8192

static inline bool should_fsync(struct thread_data td) { diff --git a/os/os.h b/os/os.h index 9a280e5..e31b30c 100644 --- a/os/os.h +++ b/os/os.h @@ -173,7 +173,7 @@ extern int fio_cpus_split(os_cpu_mask_t mask, un

endif

ifndef FIO_MAX_JOBS

-#define FIO_MAX_JOBS 4096 +#define FIO_MAX_JOBS 8192

endif

ifndef CONFIG_SOCKLEN_T

earlephilhower commented 4 years ago

Hey, that's a very good debug! Maybe you can make a Pull Request to the FIO repository with the change? I also found an issue (not a bug, just something too small for large storage systems) and submitted a PR that was quickly accepted by the author.