NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
260 stars 51 forks source link

FusionExecutorCache not synchronizing with multiple devices #3042

Open cowanmeg opened 3 weeks ago

cowanmeg commented 3 weeks ago

When we call cudaDeviceSynchronize() after FusionExecutorCache::runWithInputs() it does not block until the GPU execution is finished on devices that are not GPU 0.

Reproduction - requires at least 2 gpus: $ git clone https://github.com/cowanmeg/Fuser--recursive $ cd Fuser $ git checkout pytorch-tp Comment out line 841, https://github.com/cowanmeg/Fuser/blob/pytorch-tp/benchmarks/cpp/transformer.cpp#L841 $ python setup.py develop $ mpirun -np 2 bin/nvufser_multidevice_bench

All GPUs except 0, will return a relatively short average iteration time compared to GPU0 which is less than the actual GPU execution time.

The test prints out an element of the output to to force synchronization (cudaMemcpyAsync) since cudaDeviceSynchronize is returning before the GPU finishes (not visible in the trace): Image

On GPU0, cudaDeviceSynchronize properly waits: Image

cowanmeg commented 5 days ago

Repro: https://github.com/NVIDIA/Fuser/pull/3200

$ mpirun -np 2 bin/nvfuser_multidevice_bench

In benchmarks/cpp/transformer.cpp remove the prints after TODO to reproduce the issue.