NOAA-OWP / DMOD

Distributed Model on Demand infrastructure for OWP's Model as a Service
Other
7 stars 15 forks source link

Investigate IO performance issue with workers and object-store-backed datasets #637

Closed robertbartel closed 2 months ago

robertbartel commented 4 months ago

In recent testing, analogous jobs took several orders of magnitude longer when workers used object-store-backed datasets compared to if they used data supplied via local (manually synced) bind mounts. Rough numbers for one example are 5 hours compared to 5 minutes.

These tests involved a single node with 24 processors and 96 GB of memory, running both the services and workers. I.e., it's entirely possible the performance deficiency is heavily related to the compute infrastructure being insufficient for running jobs and an object store like this (initial development of these features involved 2-5 data-center-class servers). On the other hand, this is not an unrepresentative use-case for what we want DMOD to do.

Some investigation is needed as to whether there are any viable means to at least reasonably mitigate these issues (i.e., longer runtimes on lesser hardware are to be expected, but going from 5 minutes to 4 hours just isn't acceptable). Unless these performance issues are drastically reduced, we should likely accelerate plans for developing alternative dataset backends, such as cluster volumes within the Docker Swarm (see #593).

christophertubbs commented 4 months ago

Caching may help, but memory configuration and VM vs container vs bare metal may make a difference. Totally different products, but CEPH had issues if it had too much RAM. It had something to do with data replication. Don't remember - it was a nightmare. Reading and writing from the object store may also cause problems due to rapid io on the part of ngen. If there's just a little bit of extra overhead on the object store calls and ngen is opening and closing a bajillion files, it might be crippling. If it comes down to it, a pre and post processing step to stage the required data locally might be a solution. That'll be its own nightmare.

robertbartel commented 4 months ago

Reading and writing from the object store may also cause problems due to rapid io on the part of ngen. If there's just a little bit of extra overhead on the object store calls and ngen is opening and closing a bajillion files, it might be crippling.

This is basically my initial guess for the problem: too much IO, applied to only a single MinIO node now (due to other restrictions; see #242 et al.), running on older, less robust hardware. It's not out of the question that the object store proxy or Docker storage plugin configs could be sub-optimal, or that something else is going on, but I'm not getting my hopes up.

If it comes down to it, a pre and post processing step to stage the required data locally might be a solution. That'll be its own nightmare.

I suspect the best way to do this would actually be to just support another dataset type and add code to the data service to transform to/from such dataset, similarly to how we do other dataset derivations now. And that should also make it easy to do such things as operations unto themselves. Doing things purely within the job workers would probably be slower, would be less reusable, and might compel us to add other complicated functionality like managing and allocating disk space like we do CPUs.

robertbartel commented 4 months ago

After running another experiment yesterday, I am a little more concerned that the choke point is actually the s3fs Docker plugin itself needed for the volumes. I also am considering an alternative to just the plugin part of the object store dataset setup.

The experiment was done using my local, 2-node deployment, with the MinIO and proxy service running on one node and the ngen job constrained to the other. I also left forcing out, using a local bind mount for that, but the other datasets - in particular the output dataset being written to - were s3fs plugin volumes.

When things actually got to executing timesteps, I observed that the ngen and minio processes were not being heavily taxed: varying between 1 - 10% CPU for ngen processes and 20 - 30% CPU for minio. I also watched the network, and it wasn't being saturated (forget how much exactly, but I want to say high Kb). But the s3fs process on the Swarm node with the job service container (i.e., where the plugin of interest was) was stuck at 100% CPU.

I have a few ideas. The first few involve finding ways to get the volume plugin more than 1 CPU. Attempting to look, I haven't been able to quickly find a simple way to allocate more resources to a plugin. I've thought about using several plugins via aliasing, but the I don't think there's a good way to break up things for one dataset that way, and the writes to the output dataset seem to be what's causing the load. I've started looking at further custom modifications to the Docker plugin, though only just started.

It occurred to me though that perhaps we should consider moving away from the plugin entirely and back to FUSE mounts directly within the container. This was the original design for using object store datasets in the workers. I suspect performance would be better, and I'm very curious how much.

Of course, there was a reason why we moved away from this: it might not be feasible for some environments. The current plugin-based method is more universally compatible, so we would need to continue to support that and add a way to configure which mechanism for mounting object store datasets would be used by a deployment. And we might want to combine this idea with some of the other pieces also: e.g., dividing datasets across multiple plugins via aliasing, if the plugin method was being used.

aaraney commented 4 months ago

It occurred to me though that perhaps we should consider moving away from the plugin entirely and back to FUSE mounts directly within the container. This was the original design for using object store datasets in the workers. I suspect performance would be better, and I'm very curious how much.

I was also curious about this. Here is what I've done so far:

To start, the goal of my exploration / work was to benchmark IO performance of an s3fs-fuse mounted volume. Namely, I was interested in read and write throughput as they are primary workloads.

The aim for this initial investigation is to keep things simple, but reproducible, so that it is easy to run these benchmarks elsewhere. With that in mind, here are the hardware specs for the machine I used for testing:

OS: macOS Ventura 13.4.1 22F82 arm64
Host: MacBook Pro (14-inch, 2023)
Kernel: 22.5.0
CPU: Apple M2 Pro (10)
GPU: Apple M2 Pro (16) [Integrated]
Memory: 32.00 GiB (LPDDR5)
Disk (/): 460.00 GiB - apfs [Read-only]

All testing was done via docker, so in the same spirit, here is a truncated output of docker version from the testing machine:

Client:
 Cloud integration: v1.0.35+desktop.13
 Version:           26.1.1
 API version:       1.45
Server: Docker Desktop 4.30.0 (149282)
 Engine:
  Version:          26.1.1
  API version:      1.45 (minimum version 1.24)

For these tests, the docker desktop vm was allocated 64GB of virtual disk, 4 vCPUs, 12GB of memory, and 4GB of swap.

The "methodology" for the benchmark is pretty straightforward:

  1. run a docker container with minio server
  2. then run another docker container that has s3fs-fuse installed and create a fuse mounted directory to a bucket in the minio object store
  3. finally, run some io benchmarking software in that directory to measure the performance of the fuse mounted filesystem.

Glossing over the first two steps for now, after doing some searching, I found several mentions of iozone - a filesystem benchmark tool. The iozone docs specifically talk about benchmarking an NFS drive so it seemed like at least a good place to start. iozone doesn't look like it can for example, benchmark performing io on a large quantity of files (for example, the case of a BMI init config dataset), but it does give us pretty much what we want in the throughput metrics it provides. More specifically, iozone tests a variety of io operations varying the size of the file and the size of the record and records the throughput in Kb/second. I interpreted the record size to mean the size of a individual operation (e.g. a 2k file would be read twice with a 1k record size). The docs are pretty thorough and a great resource, here is a link for those interested.

TLDR; use iozone for step 3.

The s3fs-fuse github page links to a project that builds and publishes s3fs-fuse in a docker image. At the time of writing, the latest version of s3fs-fuse is 1.94 so, I grabbed the efrecon/s3fs image with that tag. Unfortunately, the image uses a slightly older version of alpine linux and the default apk repos did not have iozone. Luckily a newer version (one patch version newer) a alpine does have a distributable iozone. I ended up using that for simplicity.

Open two terminal sessions and follow these steps to run the benchmark:

docker network create iozone_net

mkdir data
docker run \
    --rm \
    --name=minio \
    --net iozone_net \
    --expose=9000 --expose=9001 \
    -p='9000:9000' -p '9001:9001' \
    -v $(pwd)/data:/export minio/minio:RELEASE.2022-10-08T20-11-00Z \
    server /export --console-address ":9001"

In a second terminal then run:

mc mb local/iozone

docker run -it --rm \
    --net iozone_net \
    --device /dev/fuse \
    --cap-add SYS_ADMIN \
    --security-opt "apparmor=unconfined" \
    --env 'AWS_ACCESS_KEY_ID=minioadmin' \
    --env 'AWS_SECRET_ACCESS_KEY=minioadmin' \
    --env UID=$(id -u) \
    --env GID=$(id -g) \
    --entrypoint=/bin/sh \
    efrecon/s3fs:1.94

# everything below is run from within the running container

arch=$(uname -m)
wget "http://dl-cdn.alpinelinux.org/alpine/v3.19/community/${arch}/iozone-3.506-r0.apk"
apk add --allow-untrusted iozone-3.506-r0.apk

mkdir iozone_bench
s3fs iozone /opt/s3fs/iozone_bench \
    -o url=http://minio:9000/ \
    -o use_path_request_style

cd iozone_bench
iozone -Rac

iozone could take a little while to run, but once it has finished, it should output a copy and pastable excel compliant result. I copied the output to a file, replaced all instances of 2 or more spaces with a tab character and used the following python script to get the results into a excel format:

You will need to pip install pandas openpyxl for this to work. Replace FILENAME with what is appropriate for you.

import io
import pathlib

import pandas as pd

FILENAME  = "./excel_out.tsv"

data_file = pathlib.Path(FILENAME)
data = data_file.read_text()

with pd.ExcelWriter(f"{data_file.stem}.xlsx") as writer:
    for sheet in data.split("\n\n"):
        name_idx = sheet.find("\n")
        name = sheet[:name_idx].strip('"')
        sheet_data = sheet[name_idx+1:]

        df = pd.read_table(io.StringIO(sheet_data))
        df.rename(columns={"Unnamed: 0": "File Size Kb"}, inplace=True)
        df.set_index("File Size Kb", inplace=True)
        df.columns.name = "Record Size Kb"
        df = df / 1024 # translate records from Kb/sec to Mb/sec
        df.to_excel(writer, sheet_name=name)

The output spreadsheet will have sheets for each of the io benchmarks (e.g. Writer report).

Writer report:

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
64 13 25 30 28 27
128 31 28 31 31 30 41
256 49 50 64 51 54 56 59
512 67 81 86 74 101 97 101 97
1024 77 95 115 133 119 129 139 125 141
2048 92 116 142 160 165 155 175 170 168 174
4096 90 121 155 174 181 198 186 188 185 193 188
8192 86 124 168 188 198 198 204 202 205 205 201 205
16384 85 127 157 190 205 211 209 216 211 217 211 200 209
32768 0 0 0 0 205 243 243 310 340 257 306 378 268
65536 0 0 0 0 279 308 327 363 364 278 336 281 348
131072 0 0 0 0 337 329 413 384 400 416 348 387 337
262144 0 0 0 0 357 419 456 445 430 452 467 454 322
524288 0 0 0 0 399 445 463 434 409 424 407 461 453

Reader report:

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
64 497 822 719 1786 1786
128 874 919 702 1543 3477 2719
256 1373 1525 980 1437 2379 2746 3960
512 1137 1587 1630 179 2485 2065 3568 9809
1024 2062 2295 2160 1698 2369 2232 3311 4220 5747
2048 1547 1843 1695 1875 2212 2370 2280 3072 3976 7525
4096 1895 1847 1983 2785 3080 2497 2817 3058 2751 2541 8565
8192 2587 2459 2001 1978 2140 2502 2304 2166 2280 2896 3749 7888
16384 2476 2506 2523 2504 2570 2504 2360 2500 2749 2808 3098 3398 4342
32768 0 0 0 0 2632 2601 2647 2601 2675 2814 2974 2942 3311
65536 0 0 0 0 1930 1856 1902 1496 1489 1894 1911 1889 1900
131072 0 0 0 0 1534 1577 1545 1548 1560 1560 1611 1472 1495
262144 0 0 0 0 1381 1342 1395 1413 1402 1407 1400 1323 1331
524288 0 0 0 0 1372 1354 1366 1353 1366 1362 1360 1286 1238

For comparison sake, here are "baseline" iozone benchmark runs from an unmount, /home in the same docker container:

Baseline Writer report:

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
64 473 2084 1116 1177 3702
128 1359 1179 2150 6256 6570 6570
256 2578 4391 4644 7609 3844 7143 7837
512 4058 5878 3122 7831 7802 6172 2793 3623
1024 3814 4526 4425 3367 6618 4719 3558 4828 6800
2048 2805 6454 6370 6006 7607 8099 3520 8060 7908 4794
4096 4198 1913 4329 3718 7662 5235 5256 7903 7998 7693 7170
8192 2481 5256 4287 6546 5662 5174 5989 5899 5965 6136 1703 2641
16384 3776 5305 5084 4992 4232 4201 4853 6038 6382 6054 5569 4212 3887
32768 0 0 0 0 3822 4298 6078 5873 5615 5502 5615 5170 3756
65536 0 0 0 0 4651 5625 4872 5759 4922 5497 5090 4923 3860
131072 0 0 0 0 3997 5644 5634 5583 5738 5420 5266 2513 3943
262144 0 0 0 0 3486 5535 6271 6186 6377 6245 6223 5685 4587
524288 0 0 0 0 4014 6914 6834 7174 7086 7191 7013 6767 4876

Baseline Reader report:

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
64 3458 6934 4535 12022 20471
128 8916 8348 5472 15509 20317 25199
256 9637 14017 13832 16696 21053 14603 27736
512 9246 14685 14383 17954 19099 16156 13206 22843
1024 7148 10760 13538 16657 12490 15153 13538 13495 27813
2048 8518 13622 16138 16390 18526 19073 18526 21034 22995 28552
4096 9458 11867 14136 18102 18260 16260 15395 19134 20194 20415 18522
8192 7897 11680 11696 14384 13113 15041 12442 12158 12620 11976 11283 11189
16384 6864 7296 9726 7099 8744 9435 8959 8252 8682 9201 8475 6457 6004
32768 0 0 0 0 7114 7972 8417 8294 9143 7294 7035 5957 4332
65536 0 0 0 0 7462 7423 8094 7011 7726 6446 6289 4453 3547
131072 0 0 0 0 8645 8837 8797 8783 8354 8029 6252 4184 3635
262144 0 0 0 0 10142 10754 10649 11083 9718 9578 8220 5096 3520
524288 0 0 0 0 13191 11950 13653 14012 12545 10790 10218 6516 3998

I was not overly shocked to see that there is over an order of magnitude difference between the fuse mount and local storage across the board, but I also didn't think local performance would be this poor. The fastest fuse mount combination for

aaraney commented 4 months ago

Next, I took a look at the s3fs-volume-plugin. After installing and configuring the plugin, I created a container with a volume mounted from minio. Below are heatmaps of iozone's writer and reader reports. Values are in Mb/second.

There is a substantial drop off in both writing and reading performance compared to the more direct s3fs-fuse tests above.

image

As a sanity check I ran a benchmark using s3fs-fuse directly with the max_thread_count=1 option set to see how that impacted performance. @robertbartel noted above that he notice that the plugin was using 100% cpu utilization. My thinking is this would hopefully constrain s3fs-fuse in a similar way, however the iozone benchmarks were no different than when max_thread_count=5 (the default). This is likely a false signal and iozone is not creating enough work for s3fs-fuse to become bottlenecked.

robertbartel commented 3 months ago

A few more observations during experiments, with respect to job output CSV files totaling 2.3 G in size:

So we appear to be paying a significant penalty for the plugin, but it is no where near the penalty paid for dealing with multiple files. In particular, this suggests the addition of NetCDF output from ngen (see NOAA-OWP/ngen#714 and NOAA-OWP/ngen#744) will be particularly useful.

aaraney commented 3 months ago

I got curious over lunch about how mountpoint-s3 would perform. As stated in their docs, it's intended for large file read heavy workloads. Here is an iozone benchmark of the read performance:

image

Notice the 1K file size 1K record size ~16 Gb/sec. This is nearly double the performance of s3fs-fuse which maxed out reading ~9 Gb/sec.

aaraney commented 3 months ago

1K is roughly the MTU of a TCP packet so that might make sense? The performance really drops off in adjacent file size / record size combinations though.

christophertubbs commented 3 months ago

I just found out that models previously had to have all their inputs duplicated and copied to different nodes for MPI to work prior to the availability of parallel file systems. As unpalatteable as it might be to rely on large data duplication, that might be the safest/most efficient way of running models via dmod.