Open atomicapple0 opened 3 weeks ago
Hello!
I have not tried this example on an H100 GPU, only on a V100.
The problem with the code snippet (i think), is that there are a lot of torch.cuda.synchronize
calls which add overhead in kernel launching, and my guess is that the kernels will never end up executing in parallel.
Now, they way that i recommend getting the timings is by removing the torch.cuda.synchronize
, and use the NVIDIA Nsight Systems tool to see what actually is happening in the GPU.
For example, i adapted your snippet to:
import os
from platform import node
import sched
import sys
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torchvision import models, datasets, transforms
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.nn.functional as F
from torch.multiprocessing import Pool, Process, set_start_method, Manager, Value, Lock
from datetime import timedelta
import random
import numpy as np
import time
import os
import argparse
import threading
class Conv(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = torch.nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
def forward(self, x):
y = self.conv(x)
return y
class Bnorm(torch.nn.Module):
def __init__(self):
super().__init__()
self.bn = torch.nn.BatchNorm2d(64)
def forward(self, x):
x = self.bn(x)
return x
conv_model = Conv().cuda()
bnorm_model = Bnorm().cuda()
conv_data = torch.rand([64,3,224,224]).cuda().contiguous()
bnorm_data = torch.rand([32,64,112,112]).cuda().contiguous()
conv_datas = [conv_data.clone() for _ in range(2)]
bnorm_datas = [bnorm_data.clone() for _ in range(2)]
stream = torch.cuda.Stream()
streamA = torch.cuda.Stream()
streamB = torch.cuda.Stream()
def my_test(s1, s2):
torch.cuda.synchronize()
for i in range(100):
if i==10:
torch.cuda.synchronize()
start = time.time()
with torch.cuda.stream(s1):
output_b1 = bnorm_model(bnorm_data) # bn
with torch.cuda.stream(s1):
output_c1 = conv_model(conv_data) # conv
#print(output_c1.shape)
torch.cuda.synchronize()
end = time.time()
print(f"It took {(end-start)*1000} ms")
torch.cuda.profiler.cudart().cudaProfilerStart()
my_test(streamA, streamB)
torch.cuda.profiler.cudart().cudaProfilerStop()
and run it on a V100 GPU (with cuda 11.6) and got 1.2-1.3x overall speedup compared to using the same stream for the two kernels. To verify that the kernels, indeed, run together, i profiled with the Nvidia NSight Systems tool, running:
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o output_nsys --cudabacktrace=true --capture-range=cudaProfilerApi --stop-on-range-end=true -f true -x true python3 test1.py
and then checked the trace, and saw:
which means the kernels actually are scheduled together.
Now, if i try to schedule the two convolution kernels together (using the same script, but both streams run conv) i see the following trace:
meaning that the kernels are serialized
So i would recommend using the nsys tool to see what happens. Unfortunately, i do not have access to an H100 GPU, to see exactly what happens.
I hope this helped, and please let me know if anything else is needed!
I am trying to reproduce the numbers from the
conv/bnorm
toy benchmark from the Orion paper . I saw some code provided here but did not see a script to run bnorm and conv in parallel on different streams. I rewrote this benchmark in the following script and reran on H100. I got the following results and didn't see any significant speedup from running in parallel. Any advise?Source: