NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.39k stars 1.4k forks source link

FP16 about input and loss? #87

Open LightToYang opened 5 years ago

LightToYang commented 5 years ago

I have two questions about how to train the network correctly with fp16?

First, In main_fp16_optimizer.py, input will be .half() in data_prefetcher(), and model = network_to_half(model). Should input.half be necessary? #58

train_dataset = datasets.ImageFolder(
        traindir,
        transforms.Compose([
            transforms.RandomResizedCrop(crop_size),
            transforms.RandomHorizontalFlip(),
            # transforms.ToTensor(), Too slow
            # normalize,
        ]))

Second, should we concern about the operation in the criterion (loss function), which may be more complicated such as the loss function in object detection and sementation ?

if args.fp16:
            optimizer.backward(loss)
LightToYang commented 5 years ago

The other question is Could I do fp16 operation in th main network, and do the fp32 operation in loss function?

The reason is that in the loss class, some other tensor and float tensor type data will created, will optimizer.backward(loss) do the .half() operation to these data, which may impact the final results?

        loc_t = torch.Tensor(num, num_priors, 4)
        conf_t = torch.LongTensor(num, num_priors)
        for idx in range(num):
            truths = targets[idx][:,:-1].data
            labels = targets[idx][:,-1].data
            defaults = priors.data
            match(self.threshold,truths,defaults,self.variance,labels,loc_t,conf_t,idx)
        if GPU:
            loc_t = loc_t.cuda()
            conf_t = conf_t.cuda()
        # wrap targets
        loc_t = Variable(loc_t, requires_grad=False)
        conf_t = Variable(conf_t,requires_grad=False)
LightToYang commented 5 years ago

image Does tensorOP means mix-precision training?

yaysummeriscoming commented 5 years ago

Yes, you can do your loss function in fp32. I encountered a similar issue recently with a custom loss function I was using. I found that you still need to keep the loss function input & output as fp16, but the loss function itself can be as fp32:

    orig_type = input.dtype
    input = input.type(torch.float32)
    target = target.type(torch.float32)

   loss = <your loss function here>

    loss = loss.type(orig_type)
yaysummeriscoming commented 5 years ago

One other thing, have you tried loss scaling? I found that I needed this for good fp16 performance.

matthew-z commented 5 years ago

I guess TensorOp means TensorCore math, and it is a kind of FP16 I think.

mcarilli commented 5 years ago

Loss functions often involve operations like log, softmax, and reductions that can be a danger in FP16. It's good practice to take the output of your mixed-precision model and convert it to float before sending it to the loss function. For example:

output = model(input) # this part is run in either half, float, or mixed precision, as desired
output_float = output.float()
loss = loss_fn(output_float)
optimizer.backward(loss) # if using FP16_Optimizer

The output_float = output.float() call (and any other half->float conversions) will be recorded as part of the graph. During loss.backward() or optimizer.backward(loss), these .float() conversions will be reversed, and the backward pass for your mixed-precision model itself will still be run in mixed precision.

mcarilli commented 5 years ago

Also, when using FP16_Optimizer, you do need to explicitly specify what parts of the network will run in half precision. This means you do need to call model.half() or model = network_to_half(model). After this, your model will expect half-precision input, so you need to call input.half() as well, which our case is handled by the prefetcher.

LightToYang commented 5 years ago

Thank you @mcarilli for the answer. When using Amp rather than fp16_optimizer, should I still use output_float = output.float()? Will Amp automatically transform the output to FP32 when calculating the loss in loss_fn()#53

output = model(input) # this part is run in either half, float, or mixed precision, as desired
output_float = output.float()
loss = loss_fn(output_float)
optimizer.backward(loss) # if using FP16_Optimizer

And in main_fp16_optimizer.py, why is output.float() not called ?

criterion = nn.CrossEntropyLoss().cuda()
........
# compute output
output = model(input)
loss = criterion(output, target)
mcarilli commented 5 years ago

When using Amp rather than fp16_optimizer, should I still use output_float = output.float()

You shouldn't need to. Any potentially unsafe operations should convert the inputs to float for you, so if your loss function is something that's considered unsafe, it's fine for output to be half on its way into that function, because it will be upcast under the hood. The point of Fp16_Optimizer is to allow explicit control over what's run in half or float in your model itself. The point of Amp is to cast things automatically.

And in main_fp16_optimizer.py, why is output.float() not called?

Because I overlooked it :P I probably should call output.float() there before computing the loss, although it trains fine as-is, so in this case it appears not to be necessary.