Tyelab / bruker_pipeline

Python implementation of data processing pipeline for data from the Bruker Ultima microscope in conjunction with bruker_control.
1 stars 2 forks source link

Performance is Slow with Docker + Wine; Migrate to wsl2 for converter #20

Open jmdelahanty opened 2 years ago

jmdelahanty commented 2 years ago

According to a chat I had online in the zarr developer gitter (need to find that conversation in particular) things might slow down due to the use of glob to determine how many tiffs are present in the output directory. They suggested using os.scandir instead. Another friend I was seeking advice from mentioned a similar piece of advice as well for finding the source of slowdowns. It's worth a shot to try this out and, even if it doesn't work, it's probably better to use this anyways for this use case.

bjnielsen commented 2 years ago

This is because of stat() which is often the cause of the I/O wait I was showing your a while back. stat() pulls in all the metadata, which you often don't need, like mtime, ctime, atime, etc. In fact we turn off atime on the fileserver since it's always different and quickly can cause an I/O bottleneck since it never stays in cache (always new) and it's never really used.

$ man 2 stat # shows you the structure of all information in an inode. All this is collected with any stat call. $ stat # will show you all the stat info for any given file, directory, etc.

Noting that directories are just files with tables of what's in them. The tables contain the list of filename/inodes. An inode with a table of inodes. Including a parent inode of ".." This is why a "mv" in UNIX is virtually instantaneous since you're only updating the tables in two inodes. No data actually moves.

Take home session; with any code, avoid the stat unless you need it. If you only need filenames, all your doing is reading a table in a series of inodes.

jmdelahanty commented 2 years ago

Thank you for this tip Bryan! I had no idea that's how these things work and that it was pulling all that information whenever you called stat. Also very cool to learn that's why mv is so fast! I had known it was related to just reassigning some kind of metadata, but I didn't know it had to do with the inodes.

In this case we don't even need the filenames! We literally just need how many files exist!

bjnielsen commented 2 years ago

No problem. Here's a real-world test I just did to demonstrate. Create a 1000 files, and iteration of just filenames is 68.6 times faster than iterating both filenames and stats for each of the 1000 files.

Let us know if you're seeing performance issues in general, we may have some insight into why and help out.

$ mkdir tmp1; cd tmp1; time for i in $(seq -w 1 1000); do touch $i; done; cd ..
real    0m6.529s
user    0m1.416s
sys     0m1.384s
$ time find ./tmp1/. -maxdepth 1 -type f -exec ls -l {} \; > /dev/null # force a stat on every file
real    0m2.264s
user    0m1.194s
sys     0m0.658s
# rm and recreate to avoid caching
$ rm -rf tmp1
$ mkdir tmp1; cd tmp1; time for i in $(seq -w 1 1000); do touch $i; done; cd ..
real    0m6.346s
user    0m1.454s
sys     0m1.352s
$ time find ./tmp1/. -type f -print > /dev/null 2>&1 # no stat, just path/filename
real    0m0.033s
user    0m0.000s
sys     0m0.008s
bjnielsen commented 2 years ago

In this case we don't even need the filenames! We literally just need how many files exist!

Do you need a file count? Or just if file(s) exist or not.

jmdelahanty commented 2 years ago

We just need the file count in the directory. The converter is killed only when the specified number of images has been collected or if a very long timeout period is hit. As far as I'm aware, the converter doesn't output anything if it finishes successfully or crashes, so doing something like subprocess.communicate for collecting its outputs doesn't help us here.

So what happens instead is the file system is polled every 10 seconds to see if the number of found images matches the number of expected images from the recording. See here for the conditional it's looking for.

jmdelahanty commented 2 years ago

Here's some very rough/not standard kinds of benchmarking with different use of the ripper. There's a summary at the bottom.

Note: All "Total to tiff time" includes the amount of time it takes the ripper to convert the voltage recording binary into a csv file + transferring from server to compute node when applicable. Directory locations are noted as well as where to things are being written. This assumes that the raw binaries/metadata files have been written to the server/are located there for the REMOTE conversions. For LOCAL directories, the binaries are stored on the local machine's SSD.

For the Avg Tifs/sec metric, the number is likely a fair bit higher than what's written for the LOCAL ON NATIVE WINDOWS measures because, again, the time it takes to perform the conversion to csv happens first and is included in the total time. It would be somewhat higher for the the REMOTE TO ones. Memory profiling as well as disk write speeds during conversion have yet to be profiled for this. I'm also not sure how to actually do that without just staring at the screen (which is what I'm currently doing).

LOCAL ON NATIVE WINDOWS Using: C:\Users\jdelahanty\Desktop\20211105_CSE020plane1-587.325_raw-013 Writing to: C:\Users\jdelahanty\Desktop\local_out Ripper: 5.5.64.500

AMD Ryzen 9

Copying FROM server TO local machine special-k.snl.ad.salk.edu: 22 min

Transfer topped out close to 100MB/s most of the time.

Total Tiffs: 45510

Total Conversion Time: ~ 6 min (!)

Avg Tifs/sec: ~ 126/s (!)

Total to tiff time: ~ 28 min

Writing to disk this way looks like it has a pretty sustained rate of creation of 323MB/s during processing and it's memory usage remains at just under 0.5GB throughout the whole conversion....

If there's a way to speed up transfers to the windows machine upstairs that would be very cool. A better solution would be if there was a dedicated windows machine that's linked up over some fast ethernet to move things to local SSDs, but as far as I'm aware, one is not available in the cluster at the moment. The other thing would be to have a script invoke subprocessing on that windows machine so conversions can happen in parallel. I don't know how to tell a specific Windows machine to do stuff from a script, though...

REMOTE TO NATIVE WINDOWS Using: X:\specialk\jeremy_testing\ripping_tests\20211105_CSE020plane1-587.325_raw-013 Ripper: ibid

AMD Ryzen 9

Total Tiffs: 45510

Total Conversion Time: ~ 19 min

Avg Tifs/sec: ~ 40

Total to tiff time: ~ 19 min

Reading over the 1Gbps line slows conversions down quite a bit, but isn't that much shorter than the Windows local total amount of time due to the amount of time it took to transfer to Austin's windows upstairs.

This uses approximately the same amount of memory (just under .5GB) consistently but is writing to disk at a consistently slow rate of about 22MB/s! It's transferring data from the network at a rate of near .5Gbps, so half the bandwidth of the available 1Gbps line...

REMOTE TO COMPUTE NODE SCRATCH, DOCKER WINE Using: X:\specialk\jeremy_testing\ripping_tests\20211105_CSE020plane1-587.325_raw-013 Ripper: ibid

Total Tiffs: 45510

Total Conversion Time: ~ 28 min

Avg Tifs/sec: ~ 27

Total to tiff time: ~ 28 min

This takes slightly longer maybe due to the fact that it's reading data into the machine over LAN before processing it, but is much more likely to be related to Docker + Wine doing things slowly as Annie has mentioned.

Given the slow speed of writing tiffs to disk over the network shown in the REMOTE TO NATIVE WINDOWS section, it seems like this is also a candidate for things being slow, but the fact that using the local scratch space for reading/writing data is similarly slow overall, Docker + Wine are the more likely culprits it seems...

LOCAL COMPUTE NODE SCRATCH TO LOCAL SCRATCH, DOCKER WINE Location on cheetos.snl.salk.edu

cp /snlkt/data/specialk/jeremy_testing/ripping_tests/20211105_CSE020plane1-587.325_raw-013 /scratch/snlkt2p_format_testing/ 1m17.496s

Using: /scratch/snlkt2p_format_testing/20211105_CSE020plane1-587.325_raw-013 Ripper: ibid

Total Tiffs: 45510

Total Conversion Time: ~ 23 min

Avg Tifs/sec: ~ 33

As expected, moving data to/from nodes and storage in the datacenter is very fast. However, there was only a slight speedup by having data available on the machine's SSD (disappointingly).

While staring at the memory usage through docker stats, it has the same memory profile as the native windows machine while reading data basically, under .5GB most of the time

CONTAINER ID        NAME                                                 CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
848d3eb9a764        scratch-plane1-snlkt2p_format_testing-SNLKT-ripper   102.46%             377.9MiB / 10GiB    3.69%               0B / 0B             147kB / 22.8GB      61

However, the disk read/write speeds are very slow.

While doing this:

$ iostat /dev/sdd -y 3

I see this at the very beginning once tiffs have started being generated:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdd              79.67         0.00    32018.33          0      64960

Over time this slowly decreases down further and further until it looks like it hits a floor of about 10000 kB_wrtn/s.

How to actually tell if this is a hard drive specific thing vs actually something going on in the code is something I don't think we can do that well since we have no idea what the converter is actually doing... The only thing that remains to be tested is the subject of this issue really.

I feel like most likely it's a Wine thing as Annie said, but it looks like at the very beginning of conversion it starts out doing at least okay. Why would it slow down so badly the longer it's been running? And why doesn't it ever reach the fast 300+MB/s generation of tiffs as on a windows machine?

Not sure how to properly monitor file IO/writing on linux still, but I output iostat by piping it to a txt file. Not sure how helpful it is...

local_scratch_iostat.txt

Summary:

As we've known, performing the ripping on a windows machine with data locally stored on the computer's SSD performs much much faster. I'm not aware of a dedicated windows machine that we can perform this procedure on through shell scripting as Annie has suggested doing which would be really cool. Another thing is that sending the raw binaries over the network and then performing the conversion upon incurs a performance penalty, which is compounded by the slowdowns experienced by the docker container (probably because Wine isn't managing the conversions to unix calls very well for this case, or for the other reasons mentioned above/performance penalties due to virtualizaiton through docker). Making copies of files from the server to local scratch spaces in the data center is crazy fast (just over 1 min for 75GB of stuff!) especially when compared to Austin's special-k computer (as expected since it's on a slower connection), but the unnecessary copying of data, even temporarily, isn't ideal and I would imagine it's something we'd like to avoid.

There's definitely better ways of doing benchmarking, and there are logs available for the Docker implementation about tif generation, but I'm not sure how to properly log things like memory usage, disk write speeds, and cpu usage to files yet. No real idea about how to log them on Linux properly...

bjnielsen commented 2 years ago

FYI - You can boot a linux kernel in windows, use wsl v2. This is pure linux, as opposed to an emulator like wine. https://docs.microsoft.com/en-us/windows/wsl/about I've been using wsl for sometime now when I have to use windows. It adds the "UNIX" component to windows, similar to how OSX is built around a UNIX bsd kernel so you get that for free. In all cases I have a true bash shell, exec'ing code/apps from there.

Once I figure out what the "converter" is, I can profile a breakdown. i.e. I can test data through a raw tcp/ip socket directly between cpus, then add in I/O to/from different forms of storage, etc. Break up the problem. You have a bottleneck somewhere, just need to find it. We have visibility into what's happening on the storage side as well and can add visibility into the systems too.

Could you gather a bundle of code that is the "converter" and put it on the fileserver where I can access it? Or maybe it's already a git repo? :-) I'm guessing it is, but some insight will save me some digging around. I know you gave me the example dir location of a "pile of tiffs" needing processing.

The push should be towards linux (and related tools) for pipeline processing. That's where all the power is located (99+ %); compute, storage and network.

Thanks, Bryan

On Wed, Jun 15, 2022 at 4:29 PM Jeremy Delahanty @.***> wrote:

Here's some very rough/not standard kinds of benchmarking with different use of the ripper. There's a summary at the bottom.

Note: All "Total to tiff time" includes the amount of time it takes the ripper to convert the voltage recording binary into a csv file + transferring from server to compute node when applicable. Directory locations are noted as well as where to things are being written. This assumes that the raw binaries/metadata files have been written to the server/are located there for the REMOTE conversions. For LOCAL directories, the binaries are stored on the local machine's SSD.

For the Avg Tifs/sec metric, the number is likely a fair bit higher than what's written for the LOCAL ON NATIVE WINDOWS measures because, again, the time it takes to perform the conversion to csv happens first and is included in the total time. It would be somewhat higher for the the REMOTE TO ones. Memory profiling as well as disk write speeds during conversion have yet to be profiled for this. I'm also not sure how to actually do that without just staring at the screen (which is what I'm currently doing).

LOCAL ON NATIVE WINDOWS Using: C:\Users\jdelahanty\Desktop\20211105_CSE020plane1-587.325_raw-013 Writing to: C:\Users\jdelahanty\Desktop\local_out Ripper: 5.5.64.500

AMD Ryzen 9

Copying FROM server TO local machine special-k.snl.ad.salk.edu: 22 min

Total Tiffs: 45510

Total Conversion Time: ~ 6 min (!)

Avg Tifs/sec: ~ 126/s (!)

Total to tiff time: ~ 28 min

If there's a way to speed up transfers to the windows machine upstairs that would be very cool. A better solution would be if there was a dedicated windows machine that's linked up over some fast ethernet to move things to local SSDs, but as far as I'm aware, one is not available in the cluster at the moment. The other thing would be to have a script invoke subprocessing on that windows machine so conversions can happen in parallel. I don't know how to tell a specific Windows machine to do stuff from a script, though...

REMOTE TO NATIVE WINDOWS Using: X:\specialk\jeremy_testing\ripping_tests\20211105_CSE020plane1-587.325_raw-013 Ripper: ibid

AMD Ryzen 9

Total Tiffs: 45510

Total Conversion Time: ~ 19 min

Avg Tifs/sec: ~ 40

Total to tiff time: ~ 19 min

Reading over the 1Gbps line slows conversions down quite a bit, but isn't that much shorter than the Windows local total amount of time due to the amount of time it took to transfer to Austin's windows upstairs. That transfer topped out close to 100MB/s most of the time.

REMOTE TO COMPUTE NODE SCRATCH, DOCKER WINE Using: X:\specialk\jeremy_testing\ripping_tests\20211105_CSE020plane1-587.325_raw-013 Ripper: ibid

Total Tiffs: 45510

Total Conversion Time: ~ 28 min

Avg Tifs/sec: ~ 27

Total to tiff time: ~ 28 min

This takes slightly longer maybe due to the fact that it's reading data into the machine over LAN before processing it, but is much more likely to be related to Docker + Wine doing things slowly as Annie has mentioned.

LOCAL COMPUTE NODE SCRATCH TO LOCAL SCRATCH, DOCKER WINE Location on cheetos.snl.salk.edu

cp /snlkt/data/specialk/jeremy_testing/ripping_tests/20211105_CSE020plane1-587.325_raw-013 /scratch/snlkt2p_format_testing/ 1m17.496s

Using: /scratch/snlkt2p_format_testing/20211105_CSE020plane1-587.325_raw-013 Ripper: ibid

Total Tiffs: 45510

Total Conversion Time: ~ 23 min

Avg Tifs/sec: ~ 33

As expected, moving data to/from nodes and storage in the datacenter is very fast. However, there was only a slight speedup by having data available on the machine's SSD (disappointingly).

Summary:

As we've known, performing the ripping on a windows machine with data locally stored on the computer's SSD performs much much faster. I'm not aware of a dedicated windows machine that we can perform this procedure on through shell scripting as Annie has suggested doing which would be really cool. Another thing is that sending the raw binaries over the network and then performing the conversion upon incurs a performance penalty, which is compounded by the slowdowns experienced by the docker container (probably because Wine isn't managing the conversions to unix calls very well for this case, or for the other reasons mentioned above/performance penalties due to virtualizaiton through docker). Making copies of files from the server to local scratch spaces in the data center is crazy fast (just over 1 min for 75GB of stuff!) especially when compared to Austin's special-k computer (as expected since it's on a slower connection), but the unnecessary copying of data, even temporarily, isn't ideal and I would imagine it's something we'd like to avoid.

There's definitely better ways of doing benchmarking, and there are logs available for the Docker implementation about tif generation, but I'm not sure how to properly log things like memory usage, disk write speeds, and cpu usage to files yet.

— Reply to this email directly, view it on GitHub https://github.com/Tyelab/bruker_pipeline/issues/20#issuecomment-1157071798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUXWXI2IG7SGQXMOHZGWSLVPJRMLANCNFSM5YH6NE5Q . You are receiving this because you commented.Message ID: @.***>

jmdelahanty commented 2 years ago

Sent along an email a little bit ago that has that kind of information! Can repost here if we'd like.

You can boot a linux kernel in windows, use wsl v2. This is pure linux, as opposed to an emulator like wine.

I remember now that you mentioned this! I completely forgot about it. Do we have any machines that are running wsl v2 that we can try running the converter on?

bjnielsen commented 2 years ago

Hey,

Yeah, I wanted to elaborate a bit in this thread thinking others might be watching this thread, and if not, so the information pertaining to this code base is located here. I think I'll close out that ticket, referencing this thread, and doc the progress here where it's related.

You can install wsl2 on any machine. What's the name of your win machine? I'll install it for you.

Thanks, Bryan

On Thu, Jun 16, 2022 at 12:46 AM Jeremy Delahanty @.***> wrote:

Sent along an email a little bit ago that has that kind of information! Can repost here if we'd like.

You can boot a linux kernel in windows, use wsl v2. This is pure linux, as opposed to an emulator like wine.

I remember now that you mentioned this! I completely forgot about it. Do we have any machines that are running wsl v2 that we can try running the converter on?

— Reply to this email directly, view it on GitHub https://github.com/Tyelab/bruker_pipeline/issues/20#issuecomment-1157343992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUXWXMLPBNJNXRSIGCHSUDVPLLT5ANCNFSM5YH6NE5Q . You are receiving this because you commented.Message ID: @.***>

jmdelahanty commented 2 years ago

The machine that I'm working on at my desk is called busdriver (so busdriver.snl.ad.salk.edu). On the cluster I've been using Cheetos for the docker version.

So to be clear here, the converter (you can find the executables here) was written for Windows. If I'm understanding your suggestion correctly, the idea would be to have a Windows machine with wsl2 installed in the cluster that can run this converter. That machine can be told to execute the converter from a different linux machine running a shell script or something.

If we ran it this way, we would avoid the use of Docker/Wine altogether and get the super quick speed performances of a native windows machine with the flexibility of Linux.

jmdelahanty commented 2 years ago

One thing that was discussed at our 2P meeting yesterday was to just have a Windows machine next to the Bruker computer that does nothing but ripping, which will be nice and fast, and then runs the conversion to H5 before it's sent to the server. Writing tiffs to H5 using this takes 2-3 minutes, but there are other implementations we can use too.