clearcontainers / runtime

OCI (Open Containers Initiative) compatible runtime using Virtual Machines
Apache License 2.0
589 stars 70 forks source link

[DNM] test: Check IO metrics with virtio-blk #1046

Closed sboeuf closed 5 years ago

sboeuf commented 6 years ago

Test PR to check what results we get with virtio-blk.

Fixes #1045

Signed-off-by: Sebastien Boeuf sebastien.boeuf@intel.com

sboeuf commented 6 years ago

Bummer! the metrics build could not complete...

grahamwhaley commented 6 years ago

It failed on a timeout - possibly a network timeout on a docker pull. I've pressed the rebuild button for you.

devimc commented 6 years ago

lgtm

Approved with PullApprove Approved with PullApprove

grahamwhaley commented 6 years ago

OK, this hung up again. Here is the tail of the log:

16:09:42 manage_ctr_mgr.sh - INFO: docker_info: version: 
16:09:42 manage_ctr_mgr.sh - INFO: docker_info: default runtime: cc-runtime
16:09:42 manage_ctr_mgr.sh - INFO: docker_info: package name: docker-ce
16:09:42 <13>Mar  2 16:08:30 manage_ctr_mgr.sh: configure_docker: configuring runc as default docker runtime
16:09:42 [Service]
16:09:42 Environment="HTTP_PROXY=http://172.16.22.1:911"
16:09:42 Environment="HTTPS_PROXY=http://172.16.22.1:911"
16:09:42 ExecStart=
16:09:42 ExecStart=/usr/bin/dockerd -D
16:09:42 Restart docker service
16:39:27 Build timed out (after 30 minutes). Marking the build as aborted.

It looks like maybe we have broken Docker on the metrics machines - the way I would normally think this might have happened is if we'd messed around with the graph driver config in docker and not done a full clean remove/install - iirc, as we have seen before, if you swap graph drivers but pretty much don't completely clean out /var/lib/docker/ then bad stuff happens ™️ .

Anybody think this is what might have happened? In which case, we may have shafted the metrics machines for the minute as they do not reboot between runs (unlike say the Azure VMs for QA testing, which are one shot).

Input here welcome... if my theory is right then this may have broken metrics for all PRs for the minute.

/cc @sboeuf @chavafg @devimc

chavafg commented 6 years ago

I think your theory could be right. When manage_ctr_mgr.sh changes the runtime from runc to cc-runtime, it adds the --storage-driver to overlay, but when moving back to runc, it does not provide a flag. If the default storage driver is different to overlay, what you described, could happened.

https://github.com/clearcontainers/tests/blob/master/cmd/container-manager/manage_ctr_mgr.sh#L253-L262

grahamwhaley commented 6 years ago

I've given the metrics jenkins bot a reboot, and docker is back up - let me nudge a rebuild of this...

grahamwhaley commented 6 years ago

That build ran. fs metric output:

03:56:55 | Pass | storage-IO-linear-read-bs-16k   | 180000.000 | 213706.670 | 300000.000 | 66.7 %  |     1 |     0.000 | 0.00 %  |
03:56:55 | Pass | storage-IO-linear-write-bs-16k  | 150000.000 | 185462.760 | 300000.000 | 100.0 % |     1 |     0.000 | 0.00 %  |
03:56:55 | Pass | storage-IO-random-read-bs-16k   | 150000.000 | 199852.490 | 300000.000 | 100.0 % |     1 |     0.000 | 0.00 %  |
03:56:55 | Pass | storage-IO-random-write-bs-16k  | 150000.000 | 168912.040 | 300000.000 | 100.0 % |     1 |     0.000 | 0.00 %  |

which is above our current baseline check, but still looks a bit 'down' on performance to me.

I need to go grab a bunch of our historical data and get some trends anyway - I'll post some numbers back here if/when I get them for reference.

grahamwhaley commented 6 years ago

I also need to track down why a rebuild seems to not (always?) update the build/pass status here on github for the metrics Jenkins - afaik it is set up the same as the QA runs, but the log of the rebuild seems to miss the 'posting status' bit back to github. Any ideas from @sboeuf @chavafg most welcome :-)

amshinde commented 6 years ago

@sboeuf I had a look at the fio storage tests that seemed to be failing earlier. @grahamwhaley confirmed that we run our metrics on overlay2. The fio tests use a shared volume on host for testing the storage. This volume is passed to the VM using 9pfs, so we dont really exercise any code for block storage(virtio-scsi/blk).

The only code that is exercised with the SCSI changes is this : https://github.com/containers/virtcontainers/blob/4fb2dacea4ab8b15807fa982e695487c9a20dc46/qemu.go#L355, namely

if q.config.BlockDeviceDriver == VirtioBlock {
        devices = q.arch.appendBridges(devices, q.state.Bridges)
        if err != nil {
            return err
        }
    } else {
        devices = q.arch.appendSCSIController(devices)
    }

I do not think this should affect the storage performance. @grahamwhaley Can you point me to whats the current baseline check for storage, and any background on how we have come up with it.

cc @egernst @mcastelino

Also note, a PR raised 2 days back has the storage tests passing: https://github.com/clearcontainers/runtime/pull/1048

sboeuf commented 6 years ago

@amshinde well I agree that if overlay2 is used, then there is no reason that SCSI would be the culprit for bad pnp results.

grahamwhaley commented 6 years ago

(oops, started writing this yesterday, and have since irc'd...) @amshinde The baseline is machine dependent, and is set for the checkmetrics toml file for each machine.

If we have a look at a recent passing PR metrics result (all the metrics PRs are run on the same machine right now...), then this would be a typical result - but, note, the fio disk tests are somewhat noisy right now - so, give a 20% leeway to that result, and/or run them multiple times to check what your variance looks like:

Report Summary:
07:20:35 +------+---------------------------------+------------+------------+------------+---------+-------+-----------+---------+
07:20:35 | P/F  |              NAME               |   FLOOR    |    MEAN    |  CEILING   |   GAP   | ITERS |    SD     |   COV   |
07:20:35 +------+---------------------------------+------------+------------+------------+---------+-------+-----------+---------+
...
07:20:35 | Pass | storage-IO-linear-read-bs-16k   | 180000.000 | 238691.660 | 300000.000 | 66.7 %  |     1 |     0.000 | 0.00 %  |
07:20:35 | Pass | storage-IO-linear-write-bs-16k  | 150000.000 | 204443.200 | 300000.000 | 100.0 % |     1 |     0.000 | 0.00 %  |
07:20:35 | Pass | storage-IO-random-read-bs-16k   | 150000.000 | 223790.620 | 300000.000 | 100.0 % |     1 |     0.000 | 0.00 %  |
07:20:35 | Pass | storage-IO-random-write-bs-16k  | 150000.000 | 195599.210 | 300000.000 | 100.0 % |     1 |     0.000 | 0.00 %  |
07:20:35 +------+---------------------------------+------------+------------+------------+---------+-------+-----------+---------+

The quickest thing to do is run a couple of fio test runs with and without the patch, and use checkmetrics as a fast first pass to see if there are major differences.

The FLOOR and CEILING are the range that is being checked for pass/fail. It is set very very broad on the CI right now whilst we collect up historical data (and also refine the tests to reduce the noise).