cloudera / partner-engineering

Cloudera Partner Engineering Tools
Apache License 2.0
6 stars 10 forks source link

Storage Acceptance Test fails even on bare Metal #2

Open hadoopch opened 2 years ago

hadoopch commented 2 years ago

When you run your CDP Private Cloud on none local storage Cloudera recommends doing its Storage Acceptance Test

Cloudera Enterprise Storage Device Acceptance Criteria Guide

In this document there is a link to this github page.

What i have done:

I installed Cloudera CDP private Cloud on our Cloud Platform (Openstack) and made various Test (Microbenchmark Tests) with local and none local Storage. So far I have done the tests only for the worker nodes.

Even with local storage some of the tests failed. Finally I made the tests on bare metal servers with local disks. Also here some of the tests failed

1) bare Metal and Cloud with local disks

Seqread and seqwrite succeeded. Randrw failed

FAIL[ READ_LATENCY_99>200 WRITE_LATENCY_99>200 READ_LATENCY_MEDIAN>50 WRITE_LATENCY_MEDIAN>50],10.0.1.2, results/randrw_async_32k_worker.json, rand_rw.data.disk1, randrw, /data/disk1/fio, 52.6909, 308.2813, 841.8264, 8975, 280.491663, 212.8609, 876.6095, 2534.0043, 8941, 279.419355

2) Failed on VM with None Lokal disks

All Latency tests fail

FAIL[ WRITE_LATENCY_99>200],10.0.10.183, results/randrw_async_32k_worker.json, rand_rw.data1, randrw, /data1/fio, 3.1293, 44.8266, 680.2252, 63828, 1994.643405, 4.5547, 541.0652, 680.8039, 63755, 1992.366768
FAIL[ READ_LATENCY_99>200],10.0.10.183, results/seqread_async_32k_worker.json, seq_reader.data1, read, /data1/fio, 2.8344, 725.6145, 779.9714, 128147, 4004.606621, 0, 0, 0, 0, 0
FAIL[ WRITE_LATENCY_99>200],10.0.10.183, results/seqwrite_async_32k_worker.json, seq_writer.data1, write, /data1/fio, 0, 0, 0, 0, 0, 2.8672, 692.0601, 795.1779, 128241, 4007.548978

Because of the fact that tests also failed on bare metal with local disks I have some doubt about the acceptance tests. Who has ever run successfully these tests ?

hadoopch commented 2 years ago

Here some additional infos:

OS: CentOS Linux release 7.9.2009 local HDD: SEAGATE ST2000NX0273 resp. SEAGATE ST1800MM0159 fio-Version: fio-3.7 iperf-version: iperf 3.1.7

I used the standard thresholds for the test

gerardovazquez commented 1 year ago

Hi @hadoopch ,

I hace run successfully these test in workers with 7.2k rpm and masters with 10k and 15k rpm HDD. You need to tune the disks, you have and example in how to do this in HPE Reference Architecture for Cloudera Data Platform and check further info in Performance Tuning on Linux — Disk I/O

Kind regards

hadoopch commented 1 year ago

Hi all,

in my opinion some of the test makes no sense, especially with the default value of osqd (=iodepth)

Analyzing disk latency with a high iodepth makes no sense at all.
A high iodepth is used in the async test and at the same time latency is checked against a threshold for these async tests.

If you go to a maximum of IOPS and Bandwidth the measured latency does not reflect the disk latency.

You can tune fio iodepth via osqd Parameter in the test-config.

See also, e.g.

gerardovazquez commented 1 year ago

Hi, Have you tried to (workers) set nr_requests&fifo_batch to [12-16] and adjust osqd?

KR

hadoopch commented 1 year ago

Hi Gerado,

the storage device acceptance tests was especially designed to all none local storage solutions. I made my tests on Huawei Cloud resp. Open Telekom Cloud using virtio block devices.

Have you ever tried to run the test on nodes with none local storage -e.g. vmware?

For me the test is doubtful:

Osqd is not mentioned at all in the manual. Furthermore in my opinion it makes no sense to measure disk latency with the same parameters that are used to measure IOPS and bandwidth.

Uli