Open LightToYang opened 5 years ago
The other question is Could I do fp16 operation in th main network, and do the fp32 operation in loss function?
The reason is that in the loss class, some other tensor and float tensor type data will created, will optimizer.backward(loss) do the .half() operation to these data, which may impact the final results?
loc_t = torch.Tensor(num, num_priors, 4)
conf_t = torch.LongTensor(num, num_priors)
for idx in range(num):
truths = targets[idx][:,:-1].data
labels = targets[idx][:,-1].data
defaults = priors.data
match(self.threshold,truths,defaults,self.variance,labels,loc_t,conf_t,idx)
if GPU:
loc_t = loc_t.cuda()
conf_t = conf_t.cuda()
# wrap targets
loc_t = Variable(loc_t, requires_grad=False)
conf_t = Variable(conf_t,requires_grad=False)
Does tensorOP means mix-precision training?
Yes, you can do your loss function in fp32. I encountered a similar issue recently with a custom loss function I was using. I found that you still need to keep the loss function input & output as fp16, but the loss function itself can be as fp32:
orig_type = input.dtype
input = input.type(torch.float32)
target = target.type(torch.float32)
loss = <your loss function here>
loss = loss.type(orig_type)
One other thing, have you tried loss scaling? I found that I needed this for good fp16 performance.
I guess TensorOp means TensorCore math, and it is a kind of FP16 I think.
Loss functions often involve operations like log, softmax, and reductions that can be a danger in FP16. It's good practice to take the output of your mixed-precision model and convert it to float before sending it to the loss function. For example:
output = model(input) # this part is run in either half, float, or mixed precision, as desired
output_float = output.float()
loss = loss_fn(output_float)
optimizer.backward(loss) # if using FP16_Optimizer
The output_float = output.float()
call (and any other half->float conversions) will be recorded as part of the graph. During loss.backward()
or optimizer.backward(loss)
, these .float()
conversions will be reversed, and the backward pass for your mixed-precision model itself will still be run in mixed precision.
Also, when using FP16_Optimizer, you do need to explicitly specify what parts of the network will run in half precision. This means you do need to call model.half()
or model = network_to_half(model)
. After this, your model will expect half-precision input, so you need to call input.half()
as well, which our case is handled by the prefetcher.
Thank you @mcarilli for the answer. When using Amp rather than fp16_optimizer, should I still use output_float = output.float()
? Will Amp automatically transform the output
to FP32 when calculating the loss in loss_fn()
#53
output = model(input) # this part is run in either half, float, or mixed precision, as desired
output_float = output.float()
loss = loss_fn(output_float)
optimizer.backward(loss) # if using FP16_Optimizer
And in main_fp16_optimizer.py, why is output.float()
not called ?
criterion = nn.CrossEntropyLoss().cuda()
........
# compute output
output = model(input)
loss = criterion(output, target)
When using Amp rather than fp16_optimizer, should I still use
output_float = output.float()
You shouldn't need to. Any potentially unsafe operations should convert the inputs to float for you, so if your loss function is something that's considered unsafe, it's fine for output to be half
on its way into that function, because it will be upcast under the hood. The point of Fp16_Optimizer is to allow explicit control over what's run in half or float in your model itself. The point of Amp is to cast things automatically.
And in main_fp16_optimizer.py, why is output.float() not called?
Because I overlooked it :P
I probably should call output.float()
there before computing the loss, although it trains fine as-is, so in this case it appears not to be necessary.
I have two questions about how to train the network correctly with fp16?
First, In main_fp16_optimizer.py, input will be .half() in data_prefetcher(), and model = network_to_half(model). Should input.half be necessary? #58
Second, should we concern about the operation in the criterion (loss function), which may be more complicated such as the loss function in object detection and sementation ?