Investigate IO performance issue with workers and object-store-backed datasets

robertbartel commented 4 months ago

In recent testing, analogous jobs took several orders of magnitude longer when workers used object-store-backed datasets compared to if they used data supplied via local (manually synced) bind mounts. Rough numbers for one example are 5 hours compared to 5 minutes.

These tests involved a single node with 24 processors and 96 GB of memory, running both the services and workers. I.e., it's entirely possible the performance deficiency is heavily related to the compute infrastructure being insufficient for running jobs and an object store like this (initial development of these features involved 2-5 data-center-class servers). On the other hand, this is not an unrepresentative use-case for what we want DMOD to do.

Some investigation is needed as to whether there are any viable means to at least reasonably mitigate these issues (i.e., longer runtimes on lesser hardware are to be expected, but going from 5 minutes to 4 hours just isn't acceptable). Unless these performance issues are drastically reduced, we should likely accelerate plans for developing alternative dataset backends, such as cluster volumes within the Docker Swarm (see #593).

christophertubbs commented 4 months ago

Caching may help, but memory configuration and VM vs container vs bare metal may make a difference. Totally different products, but CEPH had issues if it had too much RAM. It had something to do with data replication. Don't remember - it was a nightmare. Reading and writing from the object store may also cause problems due to rapid io on the part of ngen. If there's just a little bit of extra overhead on the object store calls and ngen is opening and closing a bajillion files, it might be crippling. If it comes down to it, a pre and post processing step to stage the required data locally might be a solution. That'll be its own nightmare.

robertbartel commented 4 months ago

Reading and writing from the object store may also cause problems due to rapid io on the part of ngen. If there's just a little bit of extra overhead on the object store calls and ngen is opening and closing a bajillion files, it might be crippling.

This is basically my initial guess for the problem: too much IO, applied to only a single MinIO node now (due to other restrictions; see #242 et al.), running on older, less robust hardware. It's not out of the question that the object store proxy or Docker storage plugin configs could be sub-optimal, or that something else is going on, but I'm not getting my hopes up.

If it comes down to it, a pre and post processing step to stage the required data locally might be a solution. That'll be its own nightmare.

I suspect the best way to do this would actually be to just support another dataset type and add code to the data service to transform to/from such dataset, similarly to how we do other dataset derivations now. And that should also make it easy to do such things as operations unto themselves. Doing things purely within the job workers would probably be slower, would be less reusable, and might compel us to add other complicated functionality like managing and allocating disk space like we do CPUs.

robertbartel commented 4 months ago

After running another experiment yesterday, I am a little more concerned that the choke point is actually the s3fs Docker plugin itself needed for the volumes. I also am considering an alternative to just the plugin part of the object store dataset setup.

The experiment was done using my local, 2-node deployment, with the MinIO and proxy service running on one node and the ngen job constrained to the other. I also left forcing out, using a local bind mount for that, but the other datasets - in particular the output dataset being written to - were s3fs plugin volumes.

When things actually got to executing timesteps, I observed that the ngen and minio processes were not being heavily taxed: varying between 1 - 10% CPU for ngen processes and 20 - 30% CPU for minio. I also watched the network, and it wasn't being saturated (forget how much exactly, but I want to say high Kb). But the s3fs process on the Swarm node with the job service container (i.e., where the plugin of interest was) was stuck at 100% CPU.

I have a few ideas. The first few involve finding ways to get the volume plugin more than 1 CPU. Attempting to look, I haven't been able to quickly find a simple way to allocate more resources to a plugin. I've thought about using several plugins via aliasing, but the I don't think there's a good way to break up things for one dataset that way, and the writes to the output dataset seem to be what's causing the load. I've started looking at further custom modifications to the Docker plugin, though only just started.

It occurred to me though that perhaps we should consider moving away from the plugin entirely and back to FUSE mounts directly within the container. This was the original design for using object store datasets in the workers. I suspect performance would be better, and I'm very curious how much.

Of course, there was a reason why we moved away from this: it might not be feasible for some environments. The current plugin-based method is more universally compatible, so we would need to continue to support that and add a way to configure which mechanism for mounting object store datasets would be used by a deployment. And we might want to combine this idea with some of the other pieces also: e.g., dividing datasets across multiple plugins via aliasing, if the plugin method was being used.

aaraney commented 4 months ago

It occurred to me though that perhaps we should consider moving away from the plugin entirely and back to FUSE mounts directly within the container. This was the original design for using object store datasets in the workers. I suspect performance would be better, and I'm very curious how much.

I was also curious about this. Here is what I've done so far:

To start, the goal of my exploration / work was to benchmark IO performance of an s3fs-fuse mounted volume. Namely, I was interested in read and write throughput as they are primary workloads.

The aim for this initial investigation is to keep things simple, but reproducible, so that it is easy to run these benchmarks elsewhere. With that in mind, here are the hardware specs for the machine I used for testing:

OS: macOS Ventura 13.4.1 22F82 arm64
Host: MacBook Pro (14-inch, 2023)
Kernel: 22.5.0
CPU: Apple M2 Pro (10)
GPU: Apple M2 Pro (16) [Integrated]
Memory: 32.00 GiB (LPDDR5)
Disk (/): 460.00 GiB - apfs [Read-only]

All testing was done via docker, so in the same spirit, here is a truncated output of docker version from the testing machine:

Client:
 Cloud integration: v1.0.35+desktop.13
 Version:           26.1.1
 API version:       1.45
Server: Docker Desktop 4.30.0 (149282)
 Engine:
  Version:          26.1.1
  API version:      1.45 (minimum version 1.24)

For these tests, the docker desktop vm was allocated 64GB of virtual disk, 4 vCPUs, 12GB of memory, and 4GB of swap.

The "methodology" for the benchmark is pretty straightforward:

run a docker container with minio server
then run another docker container that has s3fs-fuse installed and create a fuse mounted directory to a bucket in the minio object store
finally, run some io benchmarking software in that directory to measure the performance of the fuse mounted filesystem.

Glossing over the first two steps for now, after doing some searching, I found several mentions of iozone - a filesystem benchmark tool. The iozone docs specifically talk about benchmarking an NFS drive so it seemed like at least a good place to start. iozone doesn't look like it can for example, benchmark performing io on a large quantity of files (for example, the case of a BMI init config dataset), but it does give us pretty much what we want in the throughput metrics it provides. More specifically, iozone tests a variety of io operations varying the size of the file and the size of the record and records the throughput in Kb/second. I interpreted the record size to mean the size of a individual operation (e.g. a 2k file would be read twice with a 1k record size). The docs are pretty thorough and a great resource, here is a link for those interested.

TLDR; use iozone for step 3.

The s3fs-fuse github page links to a project that builds and publishes s3fs-fuse in a docker image. At the time of writing, the latest version of s3fs-fuse is 1.94 so, I grabbed the efrecon/s3fs image with that tag. Unfortunately, the image uses a slightly older version of alpine linux and the default apk repos did not have iozone. Luckily a newer version (one patch version newer) a alpine does have a distributable iozone. I ended up using that for simplicity.

Open two terminal sessions and follow these steps to run the benchmark:

docker network create iozone_net

mkdir data
docker run \
    --rm \
    --name=minio \
    --net iozone_net \
    --expose=9000 --expose=9001 \
    -p='9000:9000' -p '9001:9001' \
    -v $(pwd)/data:/export minio/minio:RELEASE.2022-10-08T20-11-00Z \
    server /export --console-address ":9001"

In a second terminal then run:

mc mb local/iozone

docker run -it --rm \
    --net iozone_net \
    --device /dev/fuse \
    --cap-add SYS_ADMIN \
    --security-opt "apparmor=unconfined" \
    --env 'AWS_ACCESS_KEY_ID=minioadmin' \
    --env 'AWS_SECRET_ACCESS_KEY=minioadmin' \
    --env UID=$(id -u) \
    --env GID=$(id -g) \
    --entrypoint=/bin/sh \
    efrecon/s3fs:1.94

# everything below is run from within the running container

arch=$(uname -m)
wget "http://dl-cdn.alpinelinux.org/alpine/v3.19/community/${arch}/iozone-3.506-r0.apk"
apk add --allow-untrusted iozone-3.506-r0.apk

mkdir iozone_bench
s3fs iozone /opt/s3fs/iozone_bench \
    -o url=http://minio:9000/ \
    -o use_path_request_style

cd iozone_bench
iozone -Rac

iozone could take a little while to run, but once it has finished, it should output a copy and pastable excel compliant result. I copied the output to a file, replaced all instances of 2 or more spaces with a tab character and used the following python script to get the results into a excel format:

You will need to pip install pandas openpyxl for this to work. Replace FILENAME with what is appropriate for you.

import io
import pathlib

import pandas as pd

FILENAME  = "./excel_out.tsv"

data_file = pathlib.Path(FILENAME)
data = data_file.read_text()

with pd.ExcelWriter(f"{data_file.stem}.xlsx") as writer:
    for sheet in data.split("\n\n"):
        name_idx = sheet.find("\n")
        name = sheet[:name_idx].strip('"')
        sheet_data = sheet[name_idx+1:]

        df = pd.read_table(io.StringIO(sheet_data))
        df.rename(columns={"Unnamed: 0": "File Size Kb"}, inplace=True)
        df.set_index("File Size Kb", inplace=True)
        df.columns.name = "Record Size Kb"
        df = df / 1024 # translate records from Kb/sec to Mb/sec
        df.to_excel(writer, sheet_name=name)

The output spreadsheet will have sheets for each of the io benchmarks (e.g. Writer report).

Column headers are the record size in Kb.
Row index values are the size of the file in Kb.
Sheet values are in Mb/sec.

Writer report:

	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384
64	13	25	30	28	27
128	31	28	31	31	30	41
256	49	50	64	51	54	56	59
512	67	81	86	74	101	97	101	97
1024	77	95	115	133	119	129	139	125	141
2048	92	116	142	160	165	155	175	170	168	174
4096	90	121	155	174	181	198	186	188	185	193	188
8192	86	124	168	188	198	198	204	202	205	205	201	205
16384	85	127	157	190	205	211	209	216	211	217	211	200	209
32768	0	0	0	0	205	243	243	310	340	257	306	378	268
65536	0	0	0	0	279	308	327	363	364	278	336	281	348
131072	0	0	0	0	337	329	413	384	400	416	348	387	337
262144	0	0	0	0	357	419	456	445	430	452	467	454	322
524288	0	0	0	0	399	445	463	434	409	424	407	461	453

Reader report:

	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384
64	497	822	719	1786	1786
128	874	919	702	1543	3477	2719
256	1373	1525	980	1437	2379	2746	3960
512	1137	1587	1630	179	2485	2065	3568	9809
1024	2062	2295	2160	1698	2369	2232	3311	4220	5747
2048	1547	1843	1695	1875	2212	2370	2280	3072	3976	7525
4096	1895	1847	1983	2785	3080	2497	2817	3058	2751	2541	8565
8192	2587	2459	2001	1978	2140	2502	2304	2166	2280	2896	3749	7888
16384	2476	2506	2523	2504	2570	2504	2360	2500	2749	2808	3098	3398	4342
32768	0	0	0	0	2632	2601	2647	2601	2675	2814	2974	2942	3311
65536	0	0	0	0	1930	1856	1902	1496	1489	1894	1911	1889	1900
131072	0	0	0	0	1534	1577	1545	1548	1560	1560	1611	1472	1495
262144	0	0	0	0	1381	1342	1395	1413	1402	1407	1400	1323	1331
524288	0	0	0	0	1372	1354	1366	1353	1366	1362	1360	1286	1238

For comparison sake, here are "baseline" iozone benchmark runs from an unmount, /home in the same docker container:

Baseline Writer report:

	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384
64	473	2084	1116	1177	3702
128	1359	1179	2150	6256	6570	6570
256	2578	4391	4644	7609	3844	7143	7837
512	4058	5878	3122	7831	7802	6172	2793	3623
1024	3814	4526	4425	3367	6618	4719	3558	4828	6800
2048	2805	6454	6370	6006	7607	8099	3520	8060	7908	4794
4096	4198	1913	4329	3718	7662	5235	5256	7903	7998	7693	7170
8192	2481	5256	4287	6546	5662	5174	5989	5899	5965	6136	1703	2641
16384	3776	5305	5084	4992	4232	4201	4853	6038	6382	6054	5569	4212	3887
32768	0	0	0	0	3822	4298	6078	5873	5615	5502	5615	5170	3756
65536	0	0	0	0	4651	5625	4872	5759	4922	5497	5090	4923	3860
131072	0	0	0	0	3997	5644	5634	5583	5738	5420	5266	2513	3943
262144	0	0	0	0	3486	5535	6271	6186	6377	6245	6223	5685	4587
524288	0	0	0	0	4014	6914	6834	7174	7086	7191	7013	6767	4876

Baseline Reader report:

	4	8	16	32	64	128	256	512	1024	2048	4096	8192	16384
64	3458	6934	4535	12022	20471
128	8916	8348	5472	15509	20317	25199
256	9637	14017	13832	16696	21053	14603	27736
512	9246	14685	14383	17954	19099	16156	13206	22843
1024	7148	10760	13538	16657	12490	15153	13538	13495	27813
2048	8518	13622	16138	16390	18526	19073	18526	21034	22995	28552
4096	9458	11867	14136	18102	18260	16260	15395	19134	20194	20415	18522
8192	7897	11680	11696	14384	13113	15041	12442	12158	12620	11976	11283	11189
16384	6864	7296	9726	7099	8744	9435	8959	8252	8682	9201	8475	6457	6004
32768	0	0	0	0	7114	7972	8417	8294	9143	7294	7035	5957	4332
65536	0	0	0	0	7462	7423	8094	7011	7726	6446	6289	4453	3547
131072	0	0	0	0	8645	8837	8797	8783	8354	8029	6252	4184	3635
262144	0	0	0	0	10142	10754	10649	11083	9718	9578	8220	5096	3520
524288	0	0	0	0	13191	11950	13653	14012	12545	10790	10218	6516	3998

I was not overly shocked to see that there is over an order of magnitude difference between the fuse mount and local storage across the board, but I also didn't think local performance would be this poor. The fastest fuse mount combination for

writing was ~456 Mb/sec for the 256Mb file size and 256Kb record size case. Sans fuse best ~8099 Mb/sec.
reading was ~9089 Mb/sec for the 512Kb file size and 512Kb record size case. Although, looking at the surrounding numbers, this looks like an outlier or a highly tuned use case. Sans fuse best ~28552 Mb/sec.

aaraney commented 4 months ago

Next, I took a look at the s3fs-volume-plugin. After installing and configuring the plugin, I created a container with a volume mounted from minio. Below are heatmaps of iozone's writer and reader reports. Values are in Mb/second.

There is a substantial drop off in both writing and reading performance compared to the more direct s3fs-fuse tests above.

writing ~198 Mb/sec compared to ~456 when using s3fs-fuse directly
reading ~6896 Mb/sec compared to ~9089 when using s3fs-fuse directly

As a sanity check I ran a benchmark using s3fs-fuse directly with the max_thread_count=1 option set to see how that impacted performance. @robertbartel noted above that he notice that the plugin was using 100% cpu utilization. My thinking is this would hopefully constrain s3fs-fuse in a similar way, however the iozone benchmarks were no different than when max_thread_count=5 (the default). This is likely a false signal and iozone is not creating enough work for s3fs-fuse to become bottlenecked.

robertbartel commented 3 months ago

A few more observations during experiments, with respect to job output CSV files totaling 2.3 G in size:

Copying finished output files individually inside of worker container to mounted dataset via volume plugin, after modifying the ngen worker image to write output to a local temp directory
- 59 minutes, 49 seconds
Copying output files individually via minio-client from bare-metal host to bucket
- 17 minutes, 8 seconds
Combining output files into uncompressed tar file on bare-metal host, then copying via minio-client as above
- 26 seconds
Within worker, write outputs to temp directory, combine (when ngen exec finished) into uncompressed tar file, and copying tar file to dataset via volume mount similar to first entry
- 59 seconds

So we appear to be paying a significant penalty for the plugin, but it is no where near the penalty paid for dealing with multiple files. In particular, this suggests the addition of NetCDF output from ngen (see NOAA-OWP/ngen#714 and NOAA-OWP/ngen#744) will be particularly useful.

aaraney commented 3 months ago

I got curious over lunch about how mountpoint-s3 would perform. As stated in their docs, it's intended for large file read heavy workloads. Here is an iozone benchmark of the read performance:

Notice the 1K file size 1K record size ~16 Gb/sec. This is nearly double the performance of s3fs-fuse which maxed out reading ~9 Gb/sec.

aaraney commented 3 months ago

1K is roughly the MTU of a TCP packet so that might make sense? The performance really drops off in adjacent file size / record size combinations though.

christophertubbs commented 3 months ago

I just found out that models previously had to have all their inputs duplicated and copied to different nodes for MPI to work prior to the availability of parallel file systems. As unpalatteable as it might be to rely on large data duplication, that might be the safest/most efficient way of running models via dmod.

NOAA-OWP / DMOD

Investigate IO performance issue with workers and object-store-backed datasets #637