fio with kind: vm no longer works

bengland2 commented 3 years ago

the fio benchmark with kind: vm will not work anymore. This is because the fio versions of the client and server must match EXACTLY, but the fio benchmark of the client is fio-3.19-3 whereas the fio benchmark of the server container image was frozen in time at quay.io/mulbc/fed-fio , so you'd see errors like this in the client log:

fio: client/server version mismatch (84 != 82)
fio: bad server cmd version 84

I fixed this by generating my own fio image using Chris Blum's handy script and modifying it slightly. The version I'm using right now is at

#!/bin/sh

IMAGE_URL=https://cloud.centos.org/centos/8/x86_64/images/CentOS-8-GenericCloud-8.3.2011-20201204.2.x86_64.qcow2
IMAGE=centos8.qcow2
LOCAL_IMAGE=mycentos8.qcow2
QUAY_IMAGE=quay.io/bengland2/centos8-fio:latest

set -x
set -e

sudo dnf install podman podman-docker wget -y

wget --continue $IMAGE_URL -O $IMAGE
cp -f $IMAGE $LOCAL_IMAGE

if ! virt-customize --help >/dev/null; then
  sudo yum install -y virt-customize || sudo yum install -y libguestfs-tools
fi

virt-customize -a $LOCAL_IMAGE \
  --install wget,libaio,curl,python3,python3-pip \
  --run-command "wget https://raw.githubusercontent.com/cloud-bulldozer/bohica/master/stockpile-wrapper/stockpile-wrapper.py && pip3
 install elasticsearch-dsl openshift kubernetes redis" \
  --copy-in fio-3.19-centos8:/usr/local/bin \
  --selinux-relabel \
  --root-password password:yourPassword \
  --firstboot-command '/usr/local/bin/fio-3.19-centos8 --server'
virt-sysprep -a $LOCAL_IMAGE

cat <<END >Dockerfile
FROM kubevirt/container-disk-v1alpha:v0.13.7
ADD $LOCAL_IMAGE /disk/
END
docker rmi $QUAY_IMAGE || echo "image already removed"
docker build -t $QUAY_IMAGE .
docker push $QUAY_IMAGE

And I built the fio image by logging into one of the VMs :

[kni@f20-h25-000-r640 benchmark-operator]$ oc get vmis
NAME                    AGE     PHASE     IP             NODENAME
fio-server-1-e3abc763   2m14s   Running   10.131.0.171   f21-h02-000-r640.rdu2.scalelab.redhat.com
...
[kni@f20-h25-000-r640 benchmark-operator]$ virtctl console fio-server-3-e3abc763
Successfully connected to fio-server-3-e3abc763 console. The escape sequence is ^]

CentOS Linux 8
Kernel 4.18.0-240.1.1.el8_3.x86_64 on an x86_64
...
fio-server-3-e3abc763 login: root
Password: 
[root@fio-server-3-e3abc763 ~]#

and doing this:

and then did the following to build fio and install it:

 dnf install -y git gcc make libaio-devel zlib-devel
 git clone https://github.com/axboe/fio
 git checkout fio-3.19
 cd fio
 git checkout fio-3.19
./configure --disable=librados --disable=librbd
make -j
 ./fio --version
fio-3.19

and saved the fio program here.

Eventually someone should make a PR and incorporate all this into the fio benchmark, but for now this is a usable workaround, I think.

jtaleric commented 3 years ago

@mulbc FYI

In our documents :

vm_image: Whether to use a pre-defined VM image with pre-installed requirements. Necessary for disconnected installs.
Note: You can use my fedora image here: quay.io/mulbc/fed-fio
Note: Only applies when kind is set to vm

However, you mention that the vm_image is different for the server/client (this seems like an issue)...

I wonder if you would of set image to quay.io/mulbc/fed-fio if that would of fixed your issue... Effectively pinning the server and client to the same container image?

bengland2 commented 3 years ago

quay.io/mulbc/fed-fio was a VM image, client is a pod (container) image, so they can't be the same. quay.io/mulbc/fed-fio is exactly what I set it to in the beginning when I got the error. It used to work, now it doesn't, his image didn't change, so it must have been the version of fio used in quay.io/cloud-bulldozer/fio image.

jtaleric commented 3 years ago

quay.io/mulbc/fed-fio

quay.io/mulbc/fed-fio was a VM image, client is a pod (container) image, so they can't be the same. quay.io/mulbc/fed-fio is exactly what I set it to in the beginning when I got the error. It used to work, now it doesn't, his image didn't change, so it must have been the version of fio used in quay.io/cloud-bulldozer/fio image.

ack - now I see that, I didn't actually look at Chris's image. If we knew which FIO he was using, we might be able to pin that, but we need a longer term solution...

mulbc commented 3 years ago

The fio VM image is just an example... Ben pinged me about the problem yesterday... he got my image creation script, if he is able to fix it by pinning to a specific fio version, I'm happy to update quay.io/mulbc/fed-fio to work again ;)

bengland2 commented 3 years ago

@mulbc exactly, not Chris's fault, but I would suggest we include VM image creation somehow as part of producing the benchmark image, so that they continue to be synch'ed in the future. And I did get it working, it was not hard, it's documented here, just needs some automation. Sorry, can't do it right now but that's why I wrote the issue, so this wouldn't get lost.

jtaleric commented 3 years ago

ack!

I was under the impression this was being maintained, since it was set as our default for fio_vm CR

https://github.com/cloud-bulldozer/benchmark-operator/blob/master/resources/crds/ripsaw_v1alpha1_fio_distributed_vm_cr.yaml#L30

jtaleric commented 3 years ago

Checking in on this @bengland2

bengland2 commented 3 years ago

haven't gotten around to submitting a PR yet. @mulbc if you get there first fine with me, I should be able to take a look at a fix within 2 weeks.

stale[bot] commented 3 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

bengland2 commented 3 years ago

I discussed with Russ, I think we need some kind of CI that has both OCS and CNV in it so that we can test things like this that are going to be used in the field with benchmark-operator (example: Goldman, Morgan-Stanley). It's non-trivial to implement from a dependency standpoint, but if we want to make benchmark-operator usable by a wider audience then that's probably what we have to do. Think of it as "productizing" benchmark-operator.

mulbc commented 3 years ago

@bengland2 how do we proceed on this? Is this a task that you track in your team's backlog?

stale[bot] commented 3 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

bengland2 commented 3 years ago

@mulbc latest discussion I heard with CNV P&S team (Jen's idea) was that we would attempt to make VMs run podman to invoke the fio benchmark-operator image that is used by openshift, so as to avoid maintaining multiple fio images, 1 for pods and 1 for VMs. Not sure if this is feasible but I think in theory it is possible to do - instead of having /mnt/pvc be configured by OpenShift, the VM script would have to make /mnt/pvc be bound to an RBD device that the script created. This could be done using podman -v /mnt/pvc:/mnt/rbd-container-X .

also Jen Abrams ( @jeniferh ) suggested --net host , the host networking would allow the image to connect to redis and elasticsearch outside the VM, assuming the VM's firewall lets it through.

So we would still need some benchmark-operator magic to start the VMs and invoke the image from within the VMs though, but it's not a different image at that point, it's just a different use of the same image, and we don't need a VM image taylored for fio, just a RHEL VM image with podman in it. This approach might make it easier to get other benchmark-operator benchmarks working with CNV also, right?

mulbc commented 3 years ago

also Jen Abrams ( @jeniferh ) suggested --net host , the host networking would allow the image to connect to redis and elasticsearch outside the VM, assuming the VM's firewall lets it through.

I don't understand this point - why do you think the VM could not connect to elasticsearch right now? Using the regular SDN does not prevent the VM from talking to anything in the cluster (last I checked)

If I understand your suggestion correctly, then you want to have a container, running a VM, running a container? :D This might work, but is maintaining a VM image with fio so much work? In other words - wouldn't it be the same amount of work to have a VM image with podman?

bengland2 commented 3 years ago

@mulbc, thank you for creating the capability to run fio inside CNV VMs in the first place. The question is how to make this more maintainable and easy to do going forward. In response:

You are correct, the VM can connect to elasticsearch right now, but the run_snafu.py inside a container inside the VM might not be able to do that unless we set up networking for the container to allow it. Hence the suggestion for podman --net host.
the VM image with podman could be set up to run any of the the benchmark-operator benchmarks, whereas the VM image with fio can only run that one benchmark. And the VM image with podman will not need to be updated when a new version of a new benchmark comes out.
right, CNV VMs are just pods that run qemu-kvm, and inside the VM we're turning around and running a container, so it is a bit comical. But it all comes down to maintainability and extending benchmark-operator to support CNV better.

@ebattat what do you think? @jtaleric ?

jeniferh commented 3 years ago

Yes this idea of using the same workload binaries provided in the container image that a 'kind: pod' run would use is something I am starting to work on. Not sure yet if we can use a chroot-based solution or will actually need to run from w/in a carefully crafted container inside the VM, but the idea is that it will reduce maintenance work for VM images for each workload as @bengland2 mentioned and we are reusing the exact same bits as pod workloads which is beneficial for testing purposes and performance comparisons.

mulbc commented 3 years ago

I think I understand this now - yes that might indeed cut down on the maintenance!

I still think that you wouldn't need the host networking for podman to connect to the elasticsearch, but we can check.

One thing we should make sure is that we forward the disk device to the inside podman container instead of mounting and forwarding that.

stale[bot] commented 2 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

cloud-bulldozer / benchmark-operator

fio with kind: vm no longer works #527