[x] support fp16 training, which will reduce 20-30% GPU memory usage.
[x] fp16 training baseline dino-r50-4scale-12ep: 49.1 AP (with amp) vs 49.2 AP (w/o amp)
Note
For MultiScaleDeformableAttention, we simply convert the input value to torch.float32 and convert the output from torch.float32 to torch.float16, which means we skip fp16 and conduct fp32 computation in MultiScaleDeformableAttention operator.
Did you observe instabilities when using the deformable attention layer with fp16? Is there another reason why the deformable attention layer cannot be used with fp16?
TODO
49.1 AP (with amp)
vs49.2 AP (w/o amp)
Note
For
MultiScaleDeformableAttention
, we simply convert the input value totorch.float32
and convert the output fromtorch.float32
totorch.float16
, which means we skip fp16 and conduct fp32 computation inMultiScaleDeformableAttention
operator.Usage
start fp16 training with
train.amp.enabled
: