Open Weigaa opened 2 years ago
Hi @Weigaa,
The error tells that the Torch tensor and DALI output used in feed_ndarray
types don't match. DALI returns int64 while torch.zeros creates float32.
Your code should do something like:
to_torch_type = {
types.DALIDataType.FLOAT : torch.float32,
types.DALIDataType.FLOAT64 : torch.float64,
types.DALIDataType.FLOAT16 : torch.float16,
types.DALIDataType.UINT8 : torch.uint8,
types.DALIDataType.INT8 : torch.int8,
types.DALIDataType.INT16 : torch.int16,
types.DALIDataType.INT32 : torch.int32,
types.DALIDataType.INT64 : torch.int64
}
dali_tensor = pipe_out[0][0]
torch_type = to_torch_type[dali_tensor.dtype]
Inputimages = torch.zeros(file[1], dtype=torch_type).to(device)
nvidia.dali.plugin.pytorch.feed_ndarray(dali_tensor, Inputimages)
@Januszl
Thank you very much.
After adding import nvidia.dali.types as types
and your suggested code, my code can run fine.
I noticed that the tensor I stored earlier are torch.cuda.FloatTensor and torch.cuda.LongTensor, can I directly use nvidia.dali.plugin.pytorch.feed_ndarray() change DALIDataType.INT64 to torch.cuda.LongTensor and DALIDataType.FLOAT to
torch.cuda.FloatTensor ?
I think the best would be to use:
Inputimages = torch.empty(file[1], dtype=torch_type, device=device)
and keep types as proposed. According to https://pytorch.org/docs/stable/tensors.html torch.cuda.LongTensor
is the TensorType while the torch.int64
is the dtype expected used to allocate an empty tensor.
@JanuszL I useInputimages = torch.empty(file[1], dtype=torch_type, device=device)
to replaceInputimages = torch.zeros(file[1], dtype=torch_type).to(device)
.
Unfortunately, Then, I used DALI and torch.load() to read the feature map from SSD to GPU respectively, and I found that DALI with GDS was much slower than torch.load() by the time.time() function to count the time, which seems abnormal, what is the reason for this?
The torch.save && torch.load code is:
import os
import time
import uuid
import inspect
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torch import cuda
from torch.utils.data import DataLoader
#from gpu_mem_track import MemTracker
device=torch.device('cuda:0')
train_data=torchvision.datasets.CIFAR10(root='./data',train=True,transform=torchvision.transforms.ToTensor(),download=True)
train_data_size=len(train_data)
#test_data_size=len(test_data)
print("train_data_size:{}".format(train_data_size))
train_dataloader=DataLoader(train_data,batch_size=2560)
'''hook'''
class SelfDeletingTempFile():
def __init__(self):
self.name=os.path.join("./",str(uuid.uuid4()))
def __del__(self):
os.remove(self.name)
def pack_hook(tensor):
temp_file=SelfDeletingTempFile()
begin = time.time()
torch.save(tensor,temp_file.name)
end = time.time()
print(tensor.shape,"offload time is", end - begin)
return temp_file
def unpack_hook(temp_file):
begin = time.time()
tensor = torch.load(temp_file.name)
end = time.time()
print(tensor.shape, "load time is", end - begin)
return tensor
""" Network architecture. """
class mymodel(nn.Module):
def __init__(self):
super(mymodel, self).__init__()
self.model1=nn.Sequential(
nn.Conv2d(3, 32, 5, padding=2),
nn.MaxPool2d(2),
nn.Conv2d(32, 32, 5, padding=2),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 5, padding=2),
nn.MaxPool2d(2),
nn.Flatten(), # 展平
nn.Linear(64 * 4 * 4, 64),
nn.Linear(64, 10),
)
def forward(self, x): # input:32*32*3
with torch.autograd.graph.saved_tensors_hooks(pack_hook,unpack_hook):
x=self.model1(x)
return x
net1=mymodel()
net1=net1.to(device)
loss_fn=nn.CrossEntropyLoss()
loss_fn=loss_fn.to(device)
optimizer=torch.optim.SGD(net1.parameters(),lr=0.1)
total_train_step=0
total_test_step=0
epoch=1
print(next(net1.parameters()).device)
total = sum([param.nelement() for param in net1.parameters()])
print("Number of parameter: %.2f M" % (total/1024/1024))
print("Memory of parameter: %.2f M " % (cuda.memory_allocated()/1024/1024))
'''start training'''
totalbegin=time.time()
#gpu_tracker.track()
for i in range(epoch):
print("--------------The {} training begins----------".format(i+1))
running_loss=0
running_correct=0
begin=time.time()
j = 0
if (j > 0):
break
for data in train_dataloader:
if (j > 0):
break
images,targets=data
#print(images.device)
images=images.to(device)
targets=targets.to(device)
outputs=net1(images)
loss=loss_fn(outputs,targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss+=loss.item()
running_correct += (outputs.argmax(1) == targets).sum()
total_train_step+=1
j += 1
#if total_train_step%100==0:
# print("number of training:{},loss:{}".format(total_train_step,loss))
end = time.time()
print("spend time: ",(end-begin)/60)
print("epoch:{}, loss:{}, accuracy:{}".format(i+1,running_loss/train_data_size,running_correct/train_data_size))
totalend=time.time()
print("total real runtime: ", (totalend - totalbegin) / 60)
The DALI code is:
import os
import time
import uuid
import inspect
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torchvision
from torch import cuda
from nvidia.dali import pipeline_def, fn
import nvidia.dali.plugin.pytorch
from torch.utils.data import DataLoader
import nvidia.dali.types as types
device=torch.device('cuda:0')
#pynvml.nvmlInit()
#frame = inspect.currentframe()
#gpu_tracker = MemTracker(frame)
train_data=torchvision.datasets.CIFAR10(root='./data',train=True,transform=torchvision.transforms.ToTensor(),download=True)
#test_data=torchvision.datasets.CIFAR10(root='./data',train=False,transform=torchvision.transforms.ToTensor(),download=True)
train_data_size=len(train_data)
#test_data_size=len(test_data)
print("train_data_size:{}".format(train_data_size))
#print("test_data_size:{}".format(test_data_size))
train_dataloader=DataLoader(train_data,batch_size=2560)
#test_dataloader=DataLoader(test_data,batch_size=128)
'''hook'''
class SelfDeletingTempFile():
def __init__(self):
self.name=os.path.join("",str(uuid.uuid4()))
def __del__(self):
freefilename = self.name +'.npy'
os.remove(freefilename)
# print("successfully free " + freefilename)
@pipeline_def(batch_size=1, num_threads=8, device_id=0)
def pipe_gds(filename):
data = fn.readers.numpy(device='gpu',file_root='.', files=filename)
return data
def pack_hook(tensor):
temp_file=SelfDeletingTempFile()
begin = time.time()
tensorshape = tensor.shape
# print("save tensor type is", tensor.type(), "shape is",tensorshape )
Inputnumpy = tensor.cpu().numpy()
np.save(temp_file.name,Inputnumpy)
file = [temp_file, tensorshape]
end = time.time()
print(tensor.shape,"offload time is", end - begin)
return file
def unpack_hook(file):
begin = time.time()
begin2 = time.time()
p = pipe_gds(filename=(file[0].name+'.npy'))
p.build()
pipe_out = p.run()
end2 = time.time()
to_torch_type = {
types.DALIDataType.FLOAT: torch.float32,
types.DALIDataType.FLOAT64: torch.float64,
types.DALIDataType.FLOAT16: torch.float16,
types.DALIDataType.UINT8: torch.uint8,
types.DALIDataType.INT8: torch.int8,
types.DALIDataType.INT16: torch.int16,
types.DALIDataType.INT32: torch.int32,
types.DALIDataType.INT64: torch.int64
}
dali_tensor = pipe_out[0][0]
# print(dali_tensor.dtype)
torch_type = to_torch_type[dali_tensor.dtype]
Inputimages = torch.empty(file[1], dtype=torch_type).to(device)
nvidia.dali.plugin.pytorch.feed_ndarray(dali_tensor, Inputimages)
end = time.time()
print(Inputimages.shape, "pure load time is", end2 - begin2)
print(Inputimages.shape, "load time is", end - begin)
# print("inputimages type is",Inputimages.shape)
return Inputimages
""" Network architecture. """
class mymodel(nn.Module):
def __init__(self):
super(mymodel, self).__init__()
self.model1=nn.Sequential(
nn.Conv2d(3, 32, 5, padding=2),
nn.MaxPool2d(2),
nn.Conv2d(32, 32, 5, padding=2),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 5, padding=2),
nn.MaxPool2d(2),
nn.Flatten(), # 展平
nn.Linear(64 * 4 * 4, 64),
nn.Linear(64, 10),
)
def forward(self, x): # input:32*32*3
with torch.autograd.graph.saved_tensors_hooks(pack_hook,unpack_hook):
x=self.model1(x)
return x
net1=mymodel()
net1=net1.to(device)
loss_fn=nn.CrossEntropyLoss()
loss_fn=loss_fn.to(device)
optimizer=torch.optim.SGD(net1.parameters(),lr=0.1)
total_train_step=0
total_test_step=0
epoch=1
print(next(net1.parameters()).device)
total = sum([param.nelement() for param in net1.parameters()])
print("Number of parameter: %.2f M" % (total/1024/1024))
print("Memory of parameter: %.2f M " % (cuda.memory_allocated()/1024/1024))
'''start training'''
totalbegin=time.time()
#gpu_tracker.track()
for i in range(epoch):
print("--------------The {} training begins----------".format(i+1))
running_loss=0
running_correct=0
begin=time.time()
j = 0
if (j > 0):
break
for data in train_dataloader:
if (j > 0):
break
images,targets=data
#print(images.device)
images=images.to(device)
targets=targets.to(device)
outputs=net1(images)
loss=loss_fn(outputs,targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss+=loss.item()
running_correct += (outputs.argmax(1) == targets).sum()
total_train_step+=1
j += 1
#if total_train_step%100==0:
# print("number of training:{},loss:{}".format(total_train_step,loss))
end = time.time()
print("spend time: ",(end-begin)/60)
print("epoch:{}, loss:{}, accuracy:{}".format(i+1,running_loss/train_data_size,running_correct/train_data_size))
totalend=time.time()
print("total real runtime: ", (totalend - totalbegin) / 60)
#gpu_tracker.track()
print("gpu memory allocated: %2.f M " % (cuda.memory_allocated()/1024/1024))
We noticed that p.build()
code occupied almost 97% load time via DALI, does it mean that DALI has a serious pipeline construction overhead (start-up overhead) and is there a way we can avoid this?
Hi @Weigaa,
DALI and GDS that DALI uses inside the numpy reader for the GPU have noticeable construction time overhead. DALI is not optimized for frequent pipeline recreation but the fast execution once the flow is defined. If your goal is to load the files from the drive to the GPU memory without any processing DALI maybe not be the best choice. You may consider trying out https://github.com/rapidsai/kvikio just for that.
Hi@JanuszL We tried using kvikio for data transfer from SSD to GPU and it seems that kvikio.Cufile() also has a large build overhead in write mode ("w"). In read mode ("r") there is no build overhead, but it seems to have much lower read bandwidth than DALI (even with the DALI pipeline batchsize set to 1). After removing the building overhead, my test gds read from SSD to GPU(3080Ti) results are shown in the table below, are the results of this experiment accurate? <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Block size(MB) | Kvikio_Cufile Bandwidth(MB/s) | DALI Bandwidth (MB/s) pipeline=1 | DALI Bandwidth(MB/s) pipeline=best -- | -- | -- | -- 1568 | 1590.909091 | 7720.33481 | 8371.596369 784 | 1549.407115 | 7619.047619 | 8276.153278 392 | 1584.478577 | 6555.183946 | 8202.552835 196 | 1542.09284 | 4708.143166 | 8112.582781 98 | 1517.027864 | 3110.12377 | 8536.585366 49 | 1585.760518 | 1749.375223 | 8044.65605 24.5 | 1531.25 | 989.4991922 | 8288.227334 12.25 | 1346.153846 | 535.1681957 | 8448.275862 6.125 | 1177.884615 | 510.8423686 | 7862.644416 3.0625 | 995.2876178 | 279.9360146 | 8019.114952
Hi, guys. I tried to combine DALI with the torch.autograd.graph.saved_tensors_hooks(pack_hook,unpack_hook) API to speed up the offloading and prefetching of intermediate feature maps to SSDs. I converted the Pytorch tensor to numpy for storage on the SSD during forward propagation, and used pipe_gds() to fetch back to the GPU during backward propagation, then completed the DALI Tensor to Pytorch via nvidia.dali.plugin.pytorch.feed_ndarray() Tensor conversion. When executing the feature map generated by the convolution layer, some errors occur and the error output is as follows.
I'm not sure if this is due to DALI or the Pytorch API, when I use torch.load() directly to read a Pytorch tensor file, no errors occur. Could you give me some suggestions for adjustments?
The reproducible code is as follows:
My GPU is Nvidia P100, and my Pytorch is 1.11.0+cu113. My Dali version is 1.14.