FOGProject / fogproject

An open source computer cloning & management system
https://fogproject.org
GNU General Public License v3.0
1.09k stars 221 forks source link

Download an image not using all bandwidth #526

Closed aweiand closed 1 year ago

aweiand commented 1 year ago

Hi!

We have an issue with download image. The host not using all the bandwidth we have (1Gb), but use average 100 ~ 150MB. But, with write (on capture image), the server use all the bandwidth....

Sebastian-Roth commented 1 year ago

@aweiand Which version of FOG do you use and which kernel (command: file /var/www/{,html/}fog/service/ipxe/bzImage*)?

Do you capture and deploy to the same kind of hardware or different models? Multicast or unicast deploy?

aweiand commented 1 year ago

Hi @Sebastian-Roth ! Thank's by the quickly reply!

This is the output:

/var/www/html/fog/service/ipxe/bzImage:   Linux kernel x86 boot executable bzImage, version 5.15.68 (buildkite-agent@Tollana) #1 SMP Sun Oct 9 06:54:22 CDT 2022, RO-rootFS, swap_dev 0x8, Normal VGA
/var/www/html/fog/service/ipxe/bzImage32: Linux kernel x86 boot executable bzImage, version 5.15.68 (buildkite-agent@Tollana) #1 SMP Sun Oct 9 06:50:09 CDT 2022, RO-rootFS, swap_dev 0x8, Normal VGA

We are capturing and deploying in the same machine/hardware under unicast deploy

Sebastian-Roth commented 1 year ago

@aweiand Again, which version of FOG do you use (bottom right corner when being logged into the FOG web UI)?

Do you have other hardware you can test delpoying to? In most cases this kind of issue is hardware specific.

As well you may start digging into what causing the slowness by testing different components one after the other. Let's start with the hard drive. Please make sure you have a safe backup copy of the data on the hard drive of this client machine you use for testing! Schedule a debug deploy task for this machine, boot it up and when you get to the console run the following command:

dd if=/dev/zero of=/dev/sda bs=4G count=1 oflag=direct status=progress

HINT: Be aware this will wipe the data off the drive! Depending on the drive you might need to use nvme0n1 instead of sda. You can use lsblk before running the above command to print out a lit of available hard drives (and partitions).

Run that command three times in a row and finish up with hdparm -I /dev/sda | grep Number (so we know what kind of hard drive this is - again you might need to use a different device name instead of sda). Take a picture of the output on screen and post here.

aweiand commented 1 year ago

@Sebastian-Roth , we use 1.5.9

Do you have other hardware you can test delpoying to? In most cases this kind of issue is hardware specific.

Yes, we test in two different machines. Capture and deploy his respective data. Let's test it =) Thank's a lot!!

Here is the picture of the commands WhatsApp Image 2022-12-22 at 12 32 02

Sebastian-Roth commented 1 year ago

@aweiand Looks like hdparm can't handle the NVMe. Don't worry about that, probably not an important information anyway.

While speeds are not up to the full throughput I would expect from a NVMe drive it's still good enough to rule it out as causing the bottle neck I suppose.

I just stumbled upon great posts by @george1421 in the forums which I had forgotten about: https://forums.fogproject.org/topic/15945/performace-testing-slow-fog-imaging https://forums.fogproject.org/topic/10459/can-you-make-fog-imaging-go-fast

aweiand commented 1 year ago

Hi @Sebastian-Roth !

Ok, I will complete the test list specified in the links and return here with the results!

Thanks' again!

aweiand commented 1 year ago

Hi @Sebastian-Roth ,

After many tests here (with support of the links), we found that fog has a issue with downloading the image. We test the dd command in a fog debug session and, in a Linux mint distribution on the same machine (mounting the NFS of fog storage) and the results shows that the mint can download the image at 1Gb, but under the fog client this don't pass the 100-200Mb.

aweiand commented 1 year ago

UPDATE:

We are triyng in a different machine, that doesn't have uefy. On this machine, the steps suggested in the first link as you show, demonstrates a dd command download in 1Gb. But, when using fog, the limit is the same as commented above (100-200Mb). There is a image that shows: image

This other screenshot shows our firewall, downloading a image to a machine with HDD and Legacy mode (in a UEFY mode shows the same results) image

This graphic shows a downloading image to a machine with NVMe and UEFY mode: image

Sebastian-Roth commented 1 year ago

@aweiand said:

After many tests here (with support of the links), we found that fog has a issue with downloading the image. We test the dd command in a fog debug session and, in a Linux mint distribution on the same machine (mounting the NFS of fog storage) and the results shows that the mint can download the image at 1Gb, but under the fog client this don't pass the 100-200Mb.

Ok, on first sight this looks like a network (driver) issue then. But we can't be sure because your testing still combines disk IO with network IO and therefore we can't say what is causing the bottle neck.

Please revisit the links mentioned and test using iperf. Best if you can test both on FOS (FOG Linux OS) and Mint.

As well please run the following commands (both on FOS and Mint) to show the network driver used:

lspci -vnn | grep -e "^[0-9a-f]" -e "Kernel module" | grep -i net -A1
ls -al /sys/class/net/*/device/driver
aweiand commented 1 year ago

@Sebastian-Roth , we realized the iperf3 tests but I'm forgot to described here (and I don't remember that). Now I'm on vacation until February... but I will ask my colleagues to test this, and write here the results...

Thank's again!! Happy Christmas and new Year!

Sebastian-Roth commented 1 year ago

@aweiand Any news on this?

aweiand commented 1 year ago

@Sebastian-Roth I'm still on vacation unit February... My colleagues don't have tested again...

Sebastian-Roth commented 1 year ago

@aweiand Looking forward to hearing from you in February... :-)

aweiand commented 1 year ago

Hi @Sebastian-Roth !

Sorry for taking so long time to respond.... Well, this is the results of iperf3 on a machine with legacy boot

image

Client on Linux Mint (not live) using iperf3 to the Fog server


image

The commands that you requested to show the network driver on cliente using the same Linux Mint


image

FOS using iperf3 to the Fog server


image

The commands for show the network driver in FOS. But, I'm need to adapt the ls because the * doesn't works (only has lo and enp0s25

Sebastian-Roth commented 1 year ago

@aweiand Let me try to sum this up:

So if all your tests were valid the only bottleneck that is left would be the server disk read speed. Or there could be something very peculiar in your network slowing down the network speed when deploying the image that is transferred from the server to the client via NFS (on unicast).

Not sure if I ever asked this: Have you done deployment to a single machine or a bunch of PCs at the same time (unicast)?

aweiand commented 1 year ago

@Sebastian-Roth , this is the sum, yep!

We only use single machine to test. In production we are using unicast too, in this case the machines shows the same transfer rate, varying more or less when we put many machines to deploy. But, the throughput on the server increases with more machines, but the deploy speed don't increase individually.

I's forgot to say, we changed the server for other we have with another architecture and nothing changes... We are using XCPng server here (on both servers).

Another situation we consider as to create another server on different machine hardware without XCPng hypervisor.

Sebastian-Roth commented 1 year ago

@aweiand said:

Another situation we consider as to create another server on different machine hardware without XCPng hypervisor.

While I have not used FOG on XCP-ng much yet I have a fair amount of experience with that hypervisor in general. I can't imagine it to cause the issue described but can't prove it myself. Before you head into that a plain NFS test is way easier to do and could shed some light onto this.

Schedule a debug deploy task, start up the client machine and when you get to the shell start the deployment by issuing the command fog. Go through the first few steps to the point where the image share is being mounted. Then stop it (Ctrl+C) to get back to the shell:

First try partclone to see if you can replicate the slow speed on a manual run:

mkfifo /tmp/pigz1
cat /images/d1p2.img >/tmp/pigz1 &
pigz -dc </tmp/pigz1 | partclone.restore -n "Test deploy" --ignore_crc -O /dev/sda2 -N -f 1

(should be able to stop this as well - Ctrl+C - if you don't want to wait to the end)

Then as a next step I suggest you try dumping the file without extraction and partclone - be aware this will shred the data on your client machine:

dd if=/images/d1p2.img of=/dev/sda2 status=progress

The numbers you see can't be compared bluntly but I hope you should see if it's going faster or not.

aweiand commented 1 year ago

Well, the results of commands are:

Comando1 First Command on FOG view

comando1_2 First Command on router view


Comando2 Second Command on router view (on terminal shows 30.0 MB/s)


Additionally, we test dd and cp (with commands in the link ), with these results:

dd_capture Capture with dd

dd_download Read with dd

dd_vs_cp_read Read with dd and cp

Well, the results not show us the expected download rate increase...

Sebastian-Roth commented 1 year ago

@aweiand I don't understand what you mean by "Second Command on router view (on terminal shows 30.0 MB/s)". From the router stats it seems to be clearly slower. But I can't imagine a plain dd to be slower than the partclone/extract run. This is kind of impossible.

Maybe the slowness is a time based issue. Do the same test several times over a day and compare the speeds (and router stats). Maybe a backup run is dragging down the speed?!

aweiand commented 1 year ago

It's because the format of the output, MB/s and Mb/s (we use pfsense in router)

We are deploying the images on a newer sff desktop today (with nvme and efi system), the slowness still persists... In this case, we don't has a backup routine running (and on the other tests too)

But, another day, on the same machine we test (and plot the graphics), we deploy an image with 1GB of bandwith (is a very old computer with HDD, with less hardware than de sff commented above)... We don't now what is causing this problem now, because is the same machine with same FOS and image and server.

I will do more tests and return here with results....

Sebastian-Roth commented 1 year ago

But, another day, on the same machine we test (and plot the graphics), we deploy an image with 1GB of bandwith (is a very old computer with HDD, with less hardware than de sff commented above)... We don't now what is causing this problem now, because is the same machine with same FOS and image and server.

Do I get this right? Deployment is simply faster if done to a different client device/machine??

You really want to read through this whole topic in the forums: https://forums.fogproject.org/post/141675 (solution in that particular post linked)

Sebastian-Roth commented 1 year ago

@aweiand We lost track of this I suppose or did you manage to solve the issue?

aweiand commented 1 year ago

@Sebastian-Roth , I miss that solution because moved to another team... but I think we did it.