Open Siris-Li opened 11 months ago
I have just read your arXiv'19 paper, "Mixed Precision Training With 8-bit Floating Point".
In Section 4. Experiments and Results, there is one sentence saying:
"For these convolution networks, the first convolution and the last fully-connected (FC) layers are maintained at a higher precision (16-bit) to maintain the model accuracy."
Is that the reason of the NoneType
error?
I meet the same issue and I found the reason for this is that the first layer's grad_input is always None:
import torch
import torch.nn as nn
import torch.optim as optim
# Define a simple DNN
class SimpleDNN(nn.Module):
def __init__(self):
super(SimpleDNN, self).__init__()
self.fc1 = nn.Linear(10, 20)
self.fc2 = nn.Linear(20, 10)
self.fc3 = nn.Linear(10, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
# Function to check whether the gradient input is None
def check_gradients(module, grad_input, grad_output):
grad_input_none = [gi is None for gi in grad_input]
print(f"{module.__class__.__name__} grad_input contains None: {grad_input_none}")
# Initialize the model, loss function, and optimizer
model = SimpleDNN()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Register backward hooks for each layer
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
module.register_full_backward_hook(check_gradients)
# Dummy input and target
inputs = torch.randn(5, 10) # Batch of 5, input size 10
targets = torch.randn(5, 1) # Batch of 5, target size 1
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
And the output is:
Linear grad_input contains None: [False]
Linear grad_input contains None: [False]
Linear grad_input contains None: [True]
Hello, I intergrated your fp8 emulator with my Lenet (2 conv layers, 3 fc layers) training process.
When I set
list_exempt_layers = ["conv1"]
, everything works well. However, when I setlist_exempt_layers = ["fc1"]
, i.e. exclude all conv layers, the code will report such errorTypeError: zeros_like(): argument 'input' (position 1) must be Tensor, not NoneType
. It seems I must include at least one conv layer inlist_exempt_layers
to run correctly.My environment is
Python 3.9, torch=2.1, cuda=12.3
Code is here: