When we call cudaDeviceSynchronize() after FusionExecutorCache::runWithInputs() it does not block until the GPU execution is finished on devices that are not GPU 0.
All GPUs except 0, will return a relatively short average iteration time compared to GPU0 which is less than the actual GPU execution time.
The test prints out an element of the output to to force synchronization (cudaMemcpyAsync) since cudaDeviceSynchronize is returning before the GPU finishes (not visible in the trace):
When we call cudaDeviceSynchronize() after FusionExecutorCache::runWithInputs() it does not block until the GPU execution is finished on devices that are not GPU 0.
Reproduction - requires at least 2 gpus: $ git clone https://github.com/cowanmeg/Fuser--recursive $ cd Fuser $ git checkout pytorch-tp Comment out line 841, https://github.com/cowanmeg/Fuser/blob/pytorch-tp/benchmarks/cpp/transformer.cpp#L841 $ python setup.py develop $ mpirun -np 2 bin/nvufser_multidevice_bench
All GPUs except 0, will return a relatively short average iteration time compared to GPU0 which is less than the actual GPU execution time.
The test prints out an element of the output to to force synchronization (cudaMemcpyAsync) since cudaDeviceSynchronize is returning before the GPU finishes (not visible in the trace):
On GPU0, cudaDeviceSynchronize properly waits: