Open Jeffkang-94 opened 2 years ago
Especially, i got NaN
in batch norm layer. Have u ever faced a kinda error below?
Hi @Jeffkang-94,
AMP might create some instabilities, especially in norm layers. So this is not entirely surprising. Disabling AMP will usually fix those, but there are several other ways do deal with this (which don't hinder performance as much), but it depends on the observed symptoms.
train_config.yaml
in the folder you trained Barlow Twins).With those additional pieces of information, we can narrow down the issue.
Thank you, Quentin
# @package _global_
config:
VERBOSE: False
LOG_FREQUENCY: 10
TEST_ONLY: False
TEST_MODEL: False
SEED_VALUE: 0
MULTI_PROCESSING_METHOD: fork
HOOKS:
PERF_STATS:
MONITOR_PERF_STATS: True
ROLLING_BTIME_FREQ: 313
CHECK_NAN: True
DATA:
NUM_DATALOADER_WORKERS: 8
TRAIN:
DATA_SOURCES: [hdf5]
DATASET_NAMES: [hdf5-slide]
BATCHSIZE_PER_REPLICA: 64
LABEL_TYPE: sample_index # just an implementation detail. Label isn't used
TRANSFORMS:
- name: ImgReplicatePil
num_times: 2
- name: RandomResizedCrop
size: 512
- name: RandomHorizontalFlip
p: 0.5
- name: ImgPilColorDistortion
strength: 0.5
- name: ImgPilMultiCropRandomApply
transforms:
- name: ImgPilGaussianBlur
p: 1.0
radius_min: 0.1
radius_max: 2.0
prob: [ 1.0, 0.1 ]
- name: ImgPilMultiCropRandomApply
transforms:
- name: ImgPilRandomSolarize
p: 1.0
prob: [ 0.0, 0.2 ]
- name: ToTensor
COLLATE_FUNCTION: simclr_collator
USE_STATEFUL_DISTRIBUTED_SAMPLER: True
MMAP_MODE: True
DROP_LAST: True
PATCH_SIZES: [1024, 4096, 16384]
INDEX_BY: "imagenet"
TRAINER:
TRAIN_STEP_NAME: standard_train_step
METERS:
name: ""
MODEL:
TRUNK:
NAME: resnet
RESNETS:
DEPTH: 34
HEAD:
PARAMS: [
["mlp", {"dims": [512, 2048], "use_relu": True, "use_bn": True, "use_bias": False, "skip_last_layer_relu_bn": False}],
["mlp", {"dims": [2048, 2048], "use_relu": True, "use_bn": True, "use_bias": False, "skip_last_layer_relu_bn": False}],
["mlp", {"dims": [2048, 2048], "use_bias": False}],
]
SYNC_BN_CONFIG:
CONVERT_BN_TO_SYNC_BN: True
SYNC_BN_TYPE: pytorch
GROUP_SIZE: 0 # global sync
AMP_PARAMS:
USE_AMP: True
AMP_TYPE: pytorch
LOSS:
name: barlow_twins_loss
barlow_twins_loss:
lambda_: 0.0051
scale_loss: 0.024
embedding_dim: 2048
OPTIMIZER:
name: lars
weight_decay: 0.000001
momentum: 0.9
num_epochs: 1000
regularize_bn: False
regularize_bias: False
param_schedulers:
lr:
auto_lr_scaling:
auto_scale: true
base_value: 0.5
base_lr_batch_size: 256
scaling_type: sqrt
name: composite
schedulers:
- name: linear
start_value: 0.0
end_value: 0.5 # Automatically rescaled if needed
- name: cosine
start_value: 0.5 # Automatically rescaled if needed
end_value: 0.002 # Automatically rescaled if needed
update_interval: step
interval_scaling: [rescaled, fixed]
lengths: [0.01, 0.99] # 1000ep
# lengths: [0.1, 0.9]
DISTRIBUTED:
BACKEND: nccl
NUM_NODES: 16
NUM_PROC_PER_NODE: 4
INIT_METHOD: env
NCCL_DEBUG: False
MACHINE:
DEVICE: gpu
CHECKPOINT:
DIR: "."
AUTO_RESUME: True
USE_LAST: True
CHECKPOINT_FREQUENCY: 1
USE_SYMLINK_CHECKPOINT_FOR_RESUME: False
CHECKPOINT_ITER_FREQUENCY: -1 # set this variable to checkpoint every few iterations
Thank you for the reply!
train_config.yaml
Some configuration names(e.g., PATCH_SIZE
, index_by
orUSE_LAST
) are new to you, but plz ignore them. We just tweaked something to be compatible with our codebase. It doesn't affect the training procedure.
We tried to run the experiment with the default setting of barlow-twins.yaml
file that you provided.
NaN
problem happened.Plus, After recognizing issue happened in batchnorm layer
, i found that apex amp
provides keep_batchnorm_fp32
option. Do you think this option could be a solution??
reference link: https://nvidia.github.io/apex/amp.html#properties
Hi @Jeffkang-94,
Yes, so the loss looks pretty good indeed... I was thinking about clipping some gradients, but it will not solve anything there.
One thing you can try is to identify if you can find a particular image that throws the model off.
Otherwise, I think having enabling the option to have BN in fp32 is definitely a road to take. We actually used tricks like this for LayerNorm
in some other places:
class Fp32LayerNorm(nn.LayerNorm):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def forward(self, input: torch.Tensor):
output = F.layer_norm(
input.float(),
self.normalized_shape,
self.weight.float() if self.weight is not None else None,
self.bias.float() if self.bias is not None else None,
self.eps,
)
return output.type_as(input)
Could you try the AMP option or this kind of trick and tell me if that works better?
A last option some people use is the following: whenever you see a NAN, just skip the backward and update of the model, and move to the next batch. But I think this is last recourse (better invest on the previously mentioned options)
Thank you, Quentin
Input image generally seems to be okay-ish. Moreover, as you already mentioned, the gradient clipping cannot iron out the issue.
I will try to apply the fp32 norm layer to make sure that the value will not be disappeared. Thank you for sharing your hack, and i will get back to you.
Thank you, Jeff kang
FYI, throughout our studies, we found that exclude_bias_and_norm: True
in the LARS
is able to prevent loss NaN value.
Instructions To Reproduce the 🐛 Bug:
what changes you made (
git diff
) or what code you wrote Nothing has been changed atbarlow_twins_loss.py
.what exact command you run: FYI, we implement our custom dataset called
hdf5
what you observed (including full logs):
While computing
barlow_twins
loss, we facedNaN
at a certain iteration48025
I used implemented
barlow_twins
loss, but we occasionally encountered aNaN
issue, making the training procedure collapse. When i deactivatetorch.cuda.amp.autocast
, the error seems to be solved.Would u mind providing some suggestions to iron out this issue if you have?