Closed robertbartel closed 2 months ago
Caching may help, but memory configuration and VM vs container vs bare metal may make a difference. Totally different products, but CEPH had issues if it had too much RAM. It had something to do with data replication. Don't remember - it was a nightmare. Reading and writing from the object store may also cause problems due to rapid io on the part of ngen. If there's just a little bit of extra overhead on the object store calls and ngen is opening and closing a bajillion files, it might be crippling. If it comes down to it, a pre and post processing step to stage the required data locally might be a solution. That'll be its own nightmare.
Reading and writing from the object store may also cause problems due to rapid io on the part of ngen. If there's just a little bit of extra overhead on the object store calls and ngen is opening and closing a bajillion files, it might be crippling.
This is basically my initial guess for the problem: too much IO, applied to only a single MinIO node now (due to other restrictions; see #242 et al.), running on older, less robust hardware. It's not out of the question that the object store proxy or Docker storage plugin configs could be sub-optimal, or that something else is going on, but I'm not getting my hopes up.
If it comes down to it, a pre and post processing step to stage the required data locally might be a solution. That'll be its own nightmare.
I suspect the best way to do this would actually be to just support another dataset type and add code to the data service to transform to/from such dataset, similarly to how we do other dataset derivations now. And that should also make it easy to do such things as operations unto themselves. Doing things purely within the job workers would probably be slower, would be less reusable, and might compel us to add other complicated functionality like managing and allocating disk space like we do CPUs.
After running another experiment yesterday, I am a little more concerned that the choke point is actually the s3fs Docker plugin itself needed for the volumes. I also am considering an alternative to just the plugin part of the object store dataset setup.
The experiment was done using my local, 2-node deployment, with the MinIO and proxy service running on one node and the ngen job constrained to the other. I also left forcing out, using a local bind mount for that, but the other datasets - in particular the output dataset being written to - were s3fs plugin volumes.
When things actually got to executing timesteps, I observed that the ngen
and minio
processes were not being heavily taxed: varying between 1 - 10% CPU for ngen
processes and 20 - 30% CPU for minio
. I also watched the network, and it wasn't being saturated (forget how much exactly, but I want to say high Kb). But the s3fs
process on the Swarm node with the job service container (i.e., where the plugin of interest was) was stuck at 100% CPU.
I have a few ideas. The first few involve finding ways to get the volume plugin more than 1 CPU. Attempting to look, I haven't been able to quickly find a simple way to allocate more resources to a plugin. I've thought about using several plugins via aliasing, but the I don't think there's a good way to break up things for one dataset that way, and the writes to the output dataset seem to be what's causing the load. I've started looking at further custom modifications to the Docker plugin, though only just started.
It occurred to me though that perhaps we should consider moving away from the plugin entirely and back to FUSE mounts directly within the container. This was the original design for using object store datasets in the workers. I suspect performance would be better, and I'm very curious how much.
Of course, there was a reason why we moved away from this: it might not be feasible for some environments. The current plugin-based method is more universally compatible, so we would need to continue to support that and add a way to configure which mechanism for mounting object store datasets would be used by a deployment. And we might want to combine this idea with some of the other pieces also: e.g., dividing datasets across multiple plugins via aliasing, if the plugin method was being used.
It occurred to me though that perhaps we should consider moving away from the plugin entirely and back to FUSE mounts directly within the container. This was the original design for using object store datasets in the workers. I suspect performance would be better, and I'm very curious how much.
I was also curious about this. Here is what I've done so far:
To start, the goal of my exploration / work was to benchmark IO performance of an s3fs-fuse
mounted volume. Namely, I was interested in read and write throughput as they are primary workloads.
The aim for this initial investigation is to keep things simple, but reproducible, so that it is easy to run these benchmarks elsewhere. With that in mind, here are the hardware specs for the machine I used for testing:
OS: macOS Ventura 13.4.1 22F82 arm64
Host: MacBook Pro (14-inch, 2023)
Kernel: 22.5.0
CPU: Apple M2 Pro (10)
GPU: Apple M2 Pro (16) [Integrated]
Memory: 32.00 GiB (LPDDR5)
Disk (/): 460.00 GiB - apfs [Read-only]
All testing was done via docker
, so in the same spirit, here is a truncated output of docker version
from the testing machine:
Client:
Cloud integration: v1.0.35+desktop.13
Version: 26.1.1
API version: 1.45
Server: Docker Desktop 4.30.0 (149282)
Engine:
Version: 26.1.1
API version: 1.45 (minimum version 1.24)
For these tests, the docker desktop vm was allocated 64GB of virtual disk, 4 vCPUs, 12GB of memory, and 4GB of swap.
The "methodology" for the benchmark is pretty straightforward:
minio
servers3fs-fuse
installed and create a fuse mounted directory to a bucket in the minio
object storeGlossing over the first two steps for now, after doing some searching, I found several mentions of iozone
- a filesystem benchmark tool. The iozone
docs specifically talk about benchmarking an NFS
drive so it seemed like at least a good place to start. iozone
doesn't look like it can for example, benchmark performing io on a large quantity of files (for example, the case of a BMI init config dataset), but it does give us pretty much what we want in the throughput metrics it provides. More specifically, iozone
tests a variety of io operations varying the size of the file and the size of the record and records the throughput in Kb/second. I interpreted the record size to mean the size of a individual operation (e.g. a 2k file would be read twice with a 1k record size). The docs are pretty thorough and a great resource, here is a link for those interested.
TLDR; use iozone
for step 3.
The s3fs-fuse
github page links to a project that builds and publishes s3fs-fuse
in a docker image. At the time of writing, the latest version of s3fs-fuse
is 1.94
so, I grabbed the efrecon/s3fs
image with that tag. Unfortunately, the image uses a slightly older version of alpine
linux and the default apk
repos did not have iozone
. Luckily a newer version (one patch version newer) a alpine
does have a distributable iozone
. I ended up using that for simplicity.
Open two terminal sessions and follow these steps to run the benchmark:
docker network create iozone_net
mkdir data
docker run \
--rm \
--name=minio \
--net iozone_net \
--expose=9000 --expose=9001 \
-p='9000:9000' -p '9001:9001' \
-v $(pwd)/data:/export minio/minio:RELEASE.2022-10-08T20-11-00Z \
server /export --console-address ":9001"
In a second terminal then run:
mc mb local/iozone
docker run -it --rm \
--net iozone_net \
--device /dev/fuse \
--cap-add SYS_ADMIN \
--security-opt "apparmor=unconfined" \
--env 'AWS_ACCESS_KEY_ID=minioadmin' \
--env 'AWS_SECRET_ACCESS_KEY=minioadmin' \
--env UID=$(id -u) \
--env GID=$(id -g) \
--entrypoint=/bin/sh \
efrecon/s3fs:1.94
# everything below is run from within the running container
arch=$(uname -m)
wget "http://dl-cdn.alpinelinux.org/alpine/v3.19/community/${arch}/iozone-3.506-r0.apk"
apk add --allow-untrusted iozone-3.506-r0.apk
mkdir iozone_bench
s3fs iozone /opt/s3fs/iozone_bench \
-o url=http://minio:9000/ \
-o use_path_request_style
cd iozone_bench
iozone -Rac
iozone
could take a little while to run, but once it has finished, it should output a copy and pastable excel compliant result. I copied the output to a file, replaced all instances of 2 or more spaces with a tab character and used the following python script to get the results into a excel format:
You will need to pip install pandas openpyxl
for this to work. Replace FILENAME
with what is appropriate for you.
import io
import pathlib
import pandas as pd
FILENAME = "./excel_out.tsv"
data_file = pathlib.Path(FILENAME)
data = data_file.read_text()
with pd.ExcelWriter(f"{data_file.stem}.xlsx") as writer:
for sheet in data.split("\n\n"):
name_idx = sheet.find("\n")
name = sheet[:name_idx].strip('"')
sheet_data = sheet[name_idx+1:]
df = pd.read_table(io.StringIO(sheet_data))
df.rename(columns={"Unnamed: 0": "File Size Kb"}, inplace=True)
df.set_index("File Size Kb", inplace=True)
df.columns.name = "Record Size Kb"
df = df / 1024 # translate records from Kb/sec to Mb/sec
df.to_excel(writer, sheet_name=name)
The output spreadsheet will have sheets for each of the io benchmarks (e.g. Writer report).
Writer report:
4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64 | 13 | 25 | 30 | 28 | 27 | ||||||||
128 | 31 | 28 | 31 | 31 | 30 | 41 | |||||||
256 | 49 | 50 | 64 | 51 | 54 | 56 | 59 | ||||||
512 | 67 | 81 | 86 | 74 | 101 | 97 | 101 | 97 | |||||
1024 | 77 | 95 | 115 | 133 | 119 | 129 | 139 | 125 | 141 | ||||
2048 | 92 | 116 | 142 | 160 | 165 | 155 | 175 | 170 | 168 | 174 | |||
4096 | 90 | 121 | 155 | 174 | 181 | 198 | 186 | 188 | 185 | 193 | 188 | ||
8192 | 86 | 124 | 168 | 188 | 198 | 198 | 204 | 202 | 205 | 205 | 201 | 205 | |
16384 | 85 | 127 | 157 | 190 | 205 | 211 | 209 | 216 | 211 | 217 | 211 | 200 | 209 |
32768 | 0 | 0 | 0 | 0 | 205 | 243 | 243 | 310 | 340 | 257 | 306 | 378 | 268 |
65536 | 0 | 0 | 0 | 0 | 279 | 308 | 327 | 363 | 364 | 278 | 336 | 281 | 348 |
131072 | 0 | 0 | 0 | 0 | 337 | 329 | 413 | 384 | 400 | 416 | 348 | 387 | 337 |
262144 | 0 | 0 | 0 | 0 | 357 | 419 | 456 | 445 | 430 | 452 | 467 | 454 | 322 |
524288 | 0 | 0 | 0 | 0 | 399 | 445 | 463 | 434 | 409 | 424 | 407 | 461 | 453 |
Reader report:
4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64 | 497 | 822 | 719 | 1786 | 1786 | ||||||||
128 | 874 | 919 | 702 | 1543 | 3477 | 2719 | |||||||
256 | 1373 | 1525 | 980 | 1437 | 2379 | 2746 | 3960 | ||||||
512 | 1137 | 1587 | 1630 | 179 | 2485 | 2065 | 3568 | 9809 | |||||
1024 | 2062 | 2295 | 2160 | 1698 | 2369 | 2232 | 3311 | 4220 | 5747 | ||||
2048 | 1547 | 1843 | 1695 | 1875 | 2212 | 2370 | 2280 | 3072 | 3976 | 7525 | |||
4096 | 1895 | 1847 | 1983 | 2785 | 3080 | 2497 | 2817 | 3058 | 2751 | 2541 | 8565 | ||
8192 | 2587 | 2459 | 2001 | 1978 | 2140 | 2502 | 2304 | 2166 | 2280 | 2896 | 3749 | 7888 | |
16384 | 2476 | 2506 | 2523 | 2504 | 2570 | 2504 | 2360 | 2500 | 2749 | 2808 | 3098 | 3398 | 4342 |
32768 | 0 | 0 | 0 | 0 | 2632 | 2601 | 2647 | 2601 | 2675 | 2814 | 2974 | 2942 | 3311 |
65536 | 0 | 0 | 0 | 0 | 1930 | 1856 | 1902 | 1496 | 1489 | 1894 | 1911 | 1889 | 1900 |
131072 | 0 | 0 | 0 | 0 | 1534 | 1577 | 1545 | 1548 | 1560 | 1560 | 1611 | 1472 | 1495 |
262144 | 0 | 0 | 0 | 0 | 1381 | 1342 | 1395 | 1413 | 1402 | 1407 | 1400 | 1323 | 1331 |
524288 | 0 | 0 | 0 | 0 | 1372 | 1354 | 1366 | 1353 | 1366 | 1362 | 1360 | 1286 | 1238 |
For comparison sake, here are "baseline" iozone
benchmark runs from an unmount, /home
in the same docker container:
Baseline Writer report:
4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64 | 473 | 2084 | 1116 | 1177 | 3702 | ||||||||
128 | 1359 | 1179 | 2150 | 6256 | 6570 | 6570 | |||||||
256 | 2578 | 4391 | 4644 | 7609 | 3844 | 7143 | 7837 | ||||||
512 | 4058 | 5878 | 3122 | 7831 | 7802 | 6172 | 2793 | 3623 | |||||
1024 | 3814 | 4526 | 4425 | 3367 | 6618 | 4719 | 3558 | 4828 | 6800 | ||||
2048 | 2805 | 6454 | 6370 | 6006 | 7607 | 8099 | 3520 | 8060 | 7908 | 4794 | |||
4096 | 4198 | 1913 | 4329 | 3718 | 7662 | 5235 | 5256 | 7903 | 7998 | 7693 | 7170 | ||
8192 | 2481 | 5256 | 4287 | 6546 | 5662 | 5174 | 5989 | 5899 | 5965 | 6136 | 1703 | 2641 | |
16384 | 3776 | 5305 | 5084 | 4992 | 4232 | 4201 | 4853 | 6038 | 6382 | 6054 | 5569 | 4212 | 3887 |
32768 | 0 | 0 | 0 | 0 | 3822 | 4298 | 6078 | 5873 | 5615 | 5502 | 5615 | 5170 | 3756 |
65536 | 0 | 0 | 0 | 0 | 4651 | 5625 | 4872 | 5759 | 4922 | 5497 | 5090 | 4923 | 3860 |
131072 | 0 | 0 | 0 | 0 | 3997 | 5644 | 5634 | 5583 | 5738 | 5420 | 5266 | 2513 | 3943 |
262144 | 0 | 0 | 0 | 0 | 3486 | 5535 | 6271 | 6186 | 6377 | 6245 | 6223 | 5685 | 4587 |
524288 | 0 | 0 | 0 | 0 | 4014 | 6914 | 6834 | 7174 | 7086 | 7191 | 7013 | 6767 | 4876 |
Baseline Reader report:
4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64 | 3458 | 6934 | 4535 | 12022 | 20471 | ||||||||
128 | 8916 | 8348 | 5472 | 15509 | 20317 | 25199 | |||||||
256 | 9637 | 14017 | 13832 | 16696 | 21053 | 14603 | 27736 | ||||||
512 | 9246 | 14685 | 14383 | 17954 | 19099 | 16156 | 13206 | 22843 | |||||
1024 | 7148 | 10760 | 13538 | 16657 | 12490 | 15153 | 13538 | 13495 | 27813 | ||||
2048 | 8518 | 13622 | 16138 | 16390 | 18526 | 19073 | 18526 | 21034 | 22995 | 28552 | |||
4096 | 9458 | 11867 | 14136 | 18102 | 18260 | 16260 | 15395 | 19134 | 20194 | 20415 | 18522 | ||
8192 | 7897 | 11680 | 11696 | 14384 | 13113 | 15041 | 12442 | 12158 | 12620 | 11976 | 11283 | 11189 | |
16384 | 6864 | 7296 | 9726 | 7099 | 8744 | 9435 | 8959 | 8252 | 8682 | 9201 | 8475 | 6457 | 6004 |
32768 | 0 | 0 | 0 | 0 | 7114 | 7972 | 8417 | 8294 | 9143 | 7294 | 7035 | 5957 | 4332 |
65536 | 0 | 0 | 0 | 0 | 7462 | 7423 | 8094 | 7011 | 7726 | 6446 | 6289 | 4453 | 3547 |
131072 | 0 | 0 | 0 | 0 | 8645 | 8837 | 8797 | 8783 | 8354 | 8029 | 6252 | 4184 | 3635 |
262144 | 0 | 0 | 0 | 0 | 10142 | 10754 | 10649 | 11083 | 9718 | 9578 | 8220 | 5096 | 3520 |
524288 | 0 | 0 | 0 | 0 | 13191 | 11950 | 13653 | 14012 | 12545 | 10790 | 10218 | 6516 | 3998 |
I was not overly shocked to see that there is over an order of magnitude difference between the fuse mount and local storage across the board, but I also didn't think local performance would be this poor. The fastest fuse mount combination for
Next, I took a look at the s3fs-volume-plugin
. After installing and configuring the plugin, I created a container with a volume mounted from minio
. Below are heatmaps of iozone
's writer and reader reports. Values are in Mb/second.
There is a substantial drop off in both writing and reading performance compared to the more direct s3fs-fuse
tests above.
s3fs-fuse
directlys3fs-fuse
directlyAs a sanity check I ran a benchmark using s3fs-fuse
directly with the max_thread_count=1
option set to see how that impacted performance. @robertbartel noted above that he notice that the plugin was using 100% cpu utilization. My thinking is this would hopefully constrain s3fs-fuse
in a similar way, however the iozone
benchmarks were no different than when max_thread_count=5
(the default). This is likely a false signal and iozone
is not creating enough work for s3fs-fuse
to become bottlenecked.
A few more observations during experiments, with respect to job output CSV files totaling 2.3 G in size:
minio-client
from bare-metal host to bucket
minio-client
as above
So we appear to be paying a significant penalty for the plugin, but it is no where near the penalty paid for dealing with multiple files. In particular, this suggests the addition of NetCDF output from ngen (see NOAA-OWP/ngen#714 and NOAA-OWP/ngen#744) will be particularly useful.
I got curious over lunch about how mountpoint-s3
would perform. As stated in their docs, it's intended for large file read heavy workloads. Here is an iozone
benchmark of the read performance:
Notice the 1K file size 1K record size ~16 Gb/sec. This is nearly double the performance of s3fs-fuse
which maxed out reading ~9 Gb/sec.
1K is roughly the MTU of a TCP packet so that might make sense? The performance really drops off in adjacent file size / record size combinations though.
I just found out that models previously had to have all their inputs duplicated and copied to different nodes for MPI to work prior to the availability of parallel file systems. As unpalatteable as it might be to rely on large data duplication, that might be the safest/most efficient way of running models via dmod.
In recent testing, analogous jobs took several orders of magnitude longer when workers used object-store-backed datasets compared to if they used data supplied via local (manually synced) bind mounts. Rough numbers for one example are 5 hours compared to 5 minutes.
These tests involved a single node with 24 processors and 96 GB of memory, running both the services and workers. I.e., it's entirely possible the performance deficiency is heavily related to the compute infrastructure being insufficient for running jobs and an object store like this (initial development of these features involved 2-5 data-center-class servers). On the other hand, this is not an unrepresentative use-case for what we want DMOD to do.
Some investigation is needed as to whether there are any viable means to at least reasonably mitigate these issues (i.e., longer runtimes on lesser hardware are to be expected, but going from 5 minutes to 4 hours just isn't acceptable). Unless these performance issues are drastically reduced, we should likely accelerate plans for developing alternative dataset backends, such as cluster volumes within the Docker Swarm (see #593).