Questions on single file read and write with multiple GPUs

breuner / elbencho

A distributed storage benchmark for file systems, object stores & block devices with support for GPUs

GNU General Public License v3.0

170 stars 25 forks source link

Questions on single file read and write with multiple GPUs #11

Closed YHRen closed 3 years ago

YHRen commented 3 years ago

Dear developers,

Thank you for the tool. I have been enjoying using it. I have the following questions regarding the arguments relates to single file read and write.

Will affinity of CPU and GPU gaurranteed when using multiple CPU workers and multiple GPUs? For example, --direct -t 96 -gpuids 0,1,2,3,4,5,6,7? In --help-all, it says "GPU IDs will be assigned round robin to different threads".
Will read from a file into multiple GPUs duplicating all the files in all GPUs? For example, elbencho -r -b 1m -s 1g -t 32 --gpuids 0,1,2,3 --gdsbufreg --direct --cufile /some/gdsremotefile. Will all 4 GPUs have the same replica of the 1GiB file? I wonder how this is done under the hood. Does all communication done via pcie or each GPU has a different shard (via pcie) and shared the missing shard via nvlink?
If we are writing a 1GiB file using multiple GPUs, will the program first create buffers of "1GiB/num-gpus" or "1GiB" on each GPU? For exampe, elbencho -w -b -1m -s 1g -t 32 --gpuids 0,1,2,3 --direct --cufile /some/gdsremotefile. How the communication happens under the hood? Does each GPU contributes a piece of the entire file?

Relating to previous item, if I do the following: elbencho -w -b 1m -t 32 --gpuids 0,1,2,3,4,5,6,7 --direct --size 1g /some/gdsremotefile I receive the following output:

OPERATION RESULT TYPE        FIRST DONE  LAST DONE
========= ================   ==========  =========
WRITE     Elapsed ms       :        397        533
         IOPS             :       1896       1920
         Throughput MiB/s :       1896       1920
         Total MiB        :        754       1024

I wonder why the FIRST DONE only has 754 MiB data transferred?

Thank you so much! I'm looking forward to your reply.

YHRen commented 3 years ago

Here are some more info on the system.

gdscheck -p
 GDS release version (beta): 0.9.1.5
 nvidia_fs version:  2.4 libcufile version: 2.3
 cuFile CONFIGURATION:
 NVMe           : Supported
 NVMeOF         : Unsupported
 SCSI           : Unsupported
 SCALEFLUX CSD  : Unsupported
 LUSTRE         : Supported
 NFS            : Unsupported
 WEKAFS         : Unsupported
 USERSPACE RDMA : Unsupported
 --MOFED peer direct  : enabled
 --rdma library       : Not Loaded (libcufile_rdma.so)
 --rdma devices       : Not configured
 --rdma_device_status : Up: 0 Down: 0
 properties.use_compat_mode : 0
 properties.use_poll_mode : 0
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4096 1048576 16777216
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 fs.generic.posix_unaligned_writes : 0
 fs.lustre.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: 0
 profile.nvtx : 0
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : 0
 GPU INFO:
 GPU index 0 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 1 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 2 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 3 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 4 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 5 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 6 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 GPU index 7 A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS
 IOMMU : disabled
 Platform verification succeeded

breuner commented 3 years ago

Thanks for using elbencho and for the kind words, @YHRen!

1) Affinity: Affinity is guaranteed between threads and GPUs in the sense that the 1st thread only uses the 1st GPU in the list, the 2nd thread only uses the 2nd GPU in the list, the 3rd thread uses only the 3rd GPU and so on. The used network file system then has to take care to use the correct network card for each of the GPUs (e.g. by sending the data for the 1st GPU only over the 1st NIC which is closest to the corresponding GPU, data for the 2nd GPU only over the 2nd NIC and so on), because this is outside of elbencho's control.

2) Duplication: Data will not be duplicated into multiple GPUs. As described above, each thread uses only a single GPU. So all data that the 1st thread reads will be placed in a buffer in the 1st GPU, all data that the 2nd thread reads will be placed in a buffer in the 2nd GPU and so on. (And vice versa for writes: All data that the 1st thread writes will come from a buffer in the 1st GPU and so on.)

3) Memory allocation size: Each thread will allocate a buffer of the size given by "-b" in GPU memory. In your example it's "-b 1m", so each thread will allocate 1MiB of memory on its assigned GPU. So with -b -1m -s 1g -t 32 --gpuids 0,1,2,3 there will be a total of 32x1MiB buffers on the GPUs. We have 4 GPUs in total, so each GPU will have 32/4=8MiB of buffers allocated by elbencho. Each write operation will reuse the same buffer, so when the 1st thread writes data, it will always pass the same 1MiB buffer pointer on the first GPU to the cuFileWrite function.

4) The "first done" and "last done" column show the aggregate result at the point in time when the fastest thread finished and when the slowest thread finished its work. Ideally, everything should be fair and thus the difference between fastest and slowest should be very small. However, some systems don't do a good job at being fair across multiple threads, that's why elbencho shows this difference. In your example, we have a total of 32 threads and a single file of 1GiB, so each thread writes exactly 1GiB/32=32MiB of data in 1MiB chunks to the file. The fastest thread finished writing its 32MiB after 397 milliseconds and at that point in time the other threads also wrote much of their data, so it adds up to 754MiB total written data at this point in time. However, the other threads did not complete their 32MiB writes yet at this point. So the "last done" column shows that the final thread finished its 32MiB after 533 milliseconds. The difference of 136 milliseconds between first finisher and last finisher seems neglectible here.

I'm leaving this issue open in case you have more questions. If all your questions are answered, please let me know, so that we can close this issue.

YHRen commented 3 years ago

Hi @breuner,

Thank you so much for taking time to craft such detailed explanation! Really appreciate it!

My apologies that it took me a while to digest your comments and prepare some materials to further our discussion.

To make sure I fully understand your comments, please allow me to summarize:

The number of workers (threads) are associated with a GPU in a round robin fashion. The affinity will be handled by underlying communication library. So, for 16 workers and 4 GPUs, 0-th GPU will get worker 0,4,8,12. This association is permanent.
For reading, each worker will have a share of data to read, the share is the total data size -s divided by the total number of workers -t. Just as you kindly explained, if -s is 1GiB and -t is 32, each worker will have a workload of 32 MiB to read.
Given a per-worker workload, e.g. 32 MiB as above, and a buffer size -b, say, 2 MiB, each worker will issue 16 read operations. The entry at First Done column and IOPS row of the result table shows the "snapshot" value of all the workers combined, when the first worker done its job.
I assume there is no synchronization between workers in either read or write, perhaps because segments of the file have been pre-assigned to different workers.

In particularly, I'm interested in 1) what buffer size to use; 2) how many workers are sufficient; 3) how DDN compares with local nvme; 4) whether cuFile can outperform regular read and write.

I'm using --gdsbufreg --cufile --direct for remote DDN and local NVME (4-way raid0) storage settings, and --direct without --cufile for regular local NVME. Are these settings correct?

(The following is likely beyond the use of elbencho, please feel free not to answer them. I will be deeply grateful if you can shed some light. Thank you in advance.)

I prepared some benchmark results on a DGX A100 system paired with DDN AX400 here.

By looking at the results, I noticed the following:

In Fig Diff, the gds-nvme has significant difference between first and last done. I'm not sure why. Is this commonly seen?
For overall performance, we should mainly take account of "last done" throughput, namely results in the Fig Last, correct?
For Fig Last subfig (p), 4MiB buffer and all 8 GPUs, when 96 workers were used, the throughput took a hit, but not 128 workers. Could you shed some light on it?
This "96 workers performance drop" also appears in other settings, (m), (n), (o), (i), but not in (a) when buffer size is small.

If you have any comments or suggestions, I will be thrilled to hear. Thank you so much.

breuner commented 3 years ago

Hi @YHRen, you're welcome!

All is correct as you wrote it in the summary bullet points, including the assumption about no communication between worker threads, because each worker calculates on its own which files or which segments of a file it has to read/write. Just a small remark on the first bullet (but I think you also got this one right): The "communication library" here is logic inside the network file system. So when the file system sees that it should read or write a buffer for GPU1, then it will ideally select the corresponding network interface closest to GPU1 to read/write the data from/to the remote storage servers. This of course only works if the network file system is configured to use all 8 network cards that are close to the GPUs.

Regarding your new questions: 1) What buffer size to use: It depends on the application. E.g. if the application is reading lots of 100KB images (like in the typical ImageNet examples) then that would be the buffer size. If you want to get a general understanding with no specific application in mind, then I would suggest to run a test with different buffer sizes. But I see from the link to your results that you already did this. 2) Same as for (1) and I see you already did this. 3) Difference for DDN system vs local NVMe: The reasonable base assumption would be that the local drives would provide lower latency and thus higher IOPS for smaller number of threads, but that the remote storage has more NVMe drives, so would be able to read faster at larger scale (at least if there is enough network bandwidth available). But interestingly, you discovered some very interesting effects in your results. 4) cuFile advantage: To have more of an apples to apples comparison, I think you don't want to compare GDS to reads into host memory, because the data needs to be in GPU memory in the end. So you want to compare GDS to reads into GPU memory without GDS, e.g. elbencho --gpuids 0,1 --cufile --gdsbufreg --direct /some/file ... vs elbencho --gpuids 0,1 --cuhostbufreg --direct /some/file ...

And regarding the 4 bullet points at the end... 1) Diff: I did not test local storage yet, because the people I worked with so far were only interested in external storage for the shared access and the higher capacity. But I can say I'm also surprised by the extreme difference that you're seeing here. On the one hand I have seen that GDS can be a problem if the data device (in my case so far always the network card, but I assume the same goes for an NVMe device) is not close on the PCI bus to the corresponding GPU. That's why the appropriate selection of the network card by the file system is so relevant. But with local drives, the file system cannot make such a dynamic selection, because it needs to read the data from the NVMe drive on which it happens to be. With 8 GPUs and 4 NVMe drives this automatically means that some drives are closer on the PCI topology to the GPUs than others, so this might explain an imbalance. 2) Last done: Yes, I agree. This is the number that is normally most relevant, because it shows you when your last application thread would finish - assuming each real application thread and each GPU has the same amount of work to do. Some applications might be smarter and do more dynamic load balancing, but that would then also introduce other side effects due to communication and synchronization between the worker threads. 3) & 4. Drop for 96 threads: This is indeed not easy to explain. It's possible that it's just a point where the Linux kernel decides to bounce threads between NUMA domains (because with GDS, all the file system logic is still running on the CPUs). You could try to see if the effect still exists when you make elbencho bind threads to NUMA zones by adding "--zones 0,1,2,3,4,5,6,7,8". With this, elbencho will bind the worker threads round-robin to the given NUMA zones, so that the kernel cannot bounce them around.

YHRen commented 3 years ago

Dear @breuner

Thank you so much for your reply!

cuFile advantage: To have more of an apples to apples comparison, I think you don't want to compare GDS to reads into host memory, because the data needs to be in GPU memory in the end. So you want to compare GDS to reads into GPU memory without GDS, e.g. elbencho --gpuids 0,1 --cufile --gdsbufreg --direct /some/file ... vs elbencho --gpuids 0,1 --cuhostbufreg --direct /some/file ...

It is correct that I'm only interested in reading into GPU memory, and writing from GPU memory to storage. I assume both command line examples are for this purpose. Namely, if I added -r the elbencho -r --gpuids 0,1 --cufile --gdsbufreg --direct /some/file will read the file to GPU 0 and 1 using cufile. Similarly, if I use -w, the command will stage some data at GPU 0 and 1, and write to the storage. The second command line example, elbencho --gpuids 0,1 --cuhostbufreg --direct /some/file, if I understand correctly, is using pinned memory (registered buffer) on the host. --cuhostbufreg is the flag I missed in my test setting. Thank you for pointing it out.

I had been puzzled greatly. My colleague pointed out that the AX400 has a theoretic uni-directional bandwidth of 50GB/sec. But my results only show ~5GiB/sec... Later we found out that the lustre file system has to be striped. Using gdsio, my colleague is is able to reach 25GB/sec (but the files have to be duplicated on different mnt... I cannot reproduce his results. both gdsio and elbencho gave me ~10GB/sec) Are you able to reach the theoretic bandwidth in your system?

Putting more thoughts into it, dividing a large file into chunks and reading each chunk to different GPU might be less interesting to our usage. It would be nice to read a set of files in a directory and each GPU reads a different file (such as neural network training or molecular dynamics output file processing). I wonder if elbencho supports such scenario. Thank you.

breuner commented 3 years ago

Hi @YHRen,

yes, all correct as you wrote it in the first paragraph. So the 2nd example without "--cufile" would be the traditional way to get data into (or out of) a GPU via a staging buffer in host memory.

I can't run tests right now, but from my records, I can say that I was able to reach the full 40GB/s that the storage system should be able to deliver by reading from a single large file via GDS. One thing that was relevant in comparison to elbencho and gdsio for this single file test case (depending on which gdsio parameters you use), is that gdsio opens the single file multiple times. This increased the read throughput in my tests.

This was my elbencho GDS read command as I would normally use it with a single file on a DGX A100:

elbencho -r --direct -b2m --cufile --gdsbufreg  --rand --gpuids "$(echo {0..7})" -t $((32*8)) /mnt/sven/mylargefile

It's using all 8 GPUs and 32 threads per GPU and specifies the filename once, so elbencho opens the file once. This gave me about 17GB/s read throughput.

Now the same command again, but giving the filename 8 times, so that elbencho will open the same file 8 times (and assign the resulting 8 file handles round-robin with the different threads):

elbencho -r --direct -b2m --cufile --gdsbufreg  --rand --gpuids "$(echo {0..7})" -t $((32*8)) $(for i in {1..8}; do echo /mnt/sven/mylargefile; done)

This gave me the full 40GB/s read throughput that the attached storage system was expected to deliver. (Note that I could have just given the same file path 8 times, but to keep the command shorter, I used this little for-loop.)

If you want to transfer the same case e.g. to 128 different large files, then that's also possible by just specifying different file names:

elbencho -r --direct -b2m --cufile --gdsbufreg  --rand --gpuids "$(echo {0..7})" -t $((32*8)) /mnt/sven/mylargefile{1..128}

Is this what you meant in your last paragraph? Otherwise you might want to look at elbencho --help-multi.

breuner commented 3 years ago

Hi @YHRen, do you have any more questions at the moment or can I close this topic?

YHRen commented 3 years ago

Hi @breuner

So sorry for my late reply. Please feel free to close the ticket. I will keep you posted on our progress. Thank you so much for your help!

shinytang6 commented 1 year ago

Hi @breuner @YHRen , really appreciate the discussion here, but I still want to ask a question.

l have seen your disscusion here:

In Fig Diff, the gds-nvme has significant difference between first and last done. I'm not sure why. Is this commonly seen?

But with local drives, the file system cannot make such a dynamic selection, because it needs to read the data from the NVMe drive on which it happens to be. With 8 GPUs and 4 NVMe drives this automatically means that some drives are closer on the PCI topology to the GPUs than others, so this might explain an imbalance.

Recently, I'm using elbencho to test GDS, with only one GPU and one local nvme ssd. And I ensured that the GPU and nvme ssd are in the same numa node, but I found that the results of first done/last done are significant different and throughput is very unstable, the discussion above seems cannot explain this phenomenon.

l wanna to ask that have you guys tested gds with only one gpu or can you shed some light on this? thanks in advance.

my command is like:

$ elbencho -r -b 128K -t 8 --gpuids 0  --gds --cufiledriveropen --cuhostbufreg /mnt/somefile