loss NaN during the training

Jeffkang-94 commented 2 years ago

Instructions To Reproduce the 🐛 Bug:

what changes you made (git diff) or what code you wrote Nothing has been changed at barlow_twins_loss.py.

what exact command you run: FYI, we implement our custom dataset called hdf5

python tools/run_distributed_engines.py \
hydra.verbose=True \
config=study/barlow_twins_cell.yaml \
config.DATA.TRAIN.DATA_PATHS=["$DATAPATH"] \
config.DATA.TRAIN.DATA_SOURCES=[hdf5] \

what you observed (including full logs):

While computing barlow_twins loss, we faced NaN at a certain iteration 48025

I used implemented barlow_twins loss, but we occasionally encountered a NaN issue, making the training procedure collapse. When i deactivate torch.cuda.amp.autocast, the error seems to be solved.

Would u mind providing some suggestions to iron out this issue if you have?

Jeffkang-94 commented 2 years ago

Especially, i got NaN in batch norm layer. Have u ever faced a kinda error below?

QuentinDuval commented 2 years ago

Hi @Jeffkang-94,

AMP might create some instabilities, especially in norm layers. So this is not entirely surprising. Disabling AMP will usually fix those, but there are several other ways do deal with this (which don't hinder performance as much), but it depends on the observed symptoms.

Could you provide me with the exact configuration that you have been using? (there should be a train_config.yaml in the folder you trained Barlow Twins).
Could you also provide me with your loss curve?

With those additional pieces of information, we can narrow down the issue.

Thank you, Quentin

Jeffkang-94 commented 2 years ago

# @package _global_
config:
  VERBOSE: False
  LOG_FREQUENCY: 10
  TEST_ONLY: False
  TEST_MODEL: False
  SEED_VALUE: 0
  MULTI_PROCESSING_METHOD: fork
  HOOKS:
    PERF_STATS:
      MONITOR_PERF_STATS: True
      ROLLING_BTIME_FREQ: 313
    CHECK_NAN: True
  DATA:
    NUM_DATALOADER_WORKERS: 8
    TRAIN:
      DATA_SOURCES: [hdf5]
      DATASET_NAMES: [hdf5-slide]
      BATCHSIZE_PER_REPLICA: 64
      LABEL_TYPE: sample_index    # just an implementation detail. Label isn't used
      TRANSFORMS:
        - name: ImgReplicatePil
          num_times: 2
        - name: RandomResizedCrop
          size: 512
        - name: RandomHorizontalFlip
          p: 0.5
        - name: ImgPilColorDistortion
          strength: 0.5
        - name: ImgPilMultiCropRandomApply
          transforms:
            - name: ImgPilGaussianBlur
              p: 1.0
              radius_min: 0.1
              radius_max: 2.0
          prob: [ 1.0, 0.1 ]
        - name: ImgPilMultiCropRandomApply
          transforms:
            - name: ImgPilRandomSolarize
              p: 1.0
          prob: [ 0.0, 0.2 ]
        - name: ToTensor
      COLLATE_FUNCTION: simclr_collator
      USE_STATEFUL_DISTRIBUTED_SAMPLER: True
      MMAP_MODE: True
      DROP_LAST: True
      PATCH_SIZES: [1024, 4096, 16384]
      INDEX_BY: "imagenet"
  TRAINER:
    TRAIN_STEP_NAME: standard_train_step
  METERS:
    name: ""
  MODEL:
    TRUNK:
      NAME: resnet
      RESNETS:
        DEPTH: 34
    HEAD:
      PARAMS: [
        ["mlp", {"dims": [512, 2048], "use_relu": True, "use_bn": True, "use_bias": False, "skip_last_layer_relu_bn": False}],
        ["mlp", {"dims": [2048, 2048], "use_relu": True, "use_bn": True, "use_bias": False, "skip_last_layer_relu_bn": False}],
        ["mlp", {"dims": [2048, 2048], "use_bias": False}],
      ]
    SYNC_BN_CONFIG:
      CONVERT_BN_TO_SYNC_BN: True
      SYNC_BN_TYPE: pytorch
      GROUP_SIZE: 0 # global sync
    AMP_PARAMS:
      USE_AMP: True
      AMP_TYPE: pytorch
  LOSS:
      name: barlow_twins_loss
      barlow_twins_loss:
        lambda_: 0.0051
        scale_loss: 0.024
        embedding_dim: 2048
  OPTIMIZER:
      name: lars
      weight_decay: 0.000001
      momentum: 0.9
      num_epochs: 1000
      regularize_bn: False
      regularize_bias: False
      param_schedulers:
        lr:
          auto_lr_scaling:
            auto_scale: true
            base_value: 0.5
            base_lr_batch_size: 256
            scaling_type: sqrt
          name: composite
          schedulers:
            - name: linear
              start_value: 0.0
              end_value: 0.5 # Automatically rescaled if needed
            - name: cosine
              start_value: 0.5 # Automatically rescaled if needed
              end_value: 0.002 # Automatically rescaled if needed
          update_interval: step
          interval_scaling: [rescaled, fixed]
          lengths: [0.01, 0.99]             # 1000ep
          # lengths: [0.1, 0.9]
  DISTRIBUTED:
    BACKEND: nccl
    NUM_NODES: 16
    NUM_PROC_PER_NODE: 4
    INIT_METHOD: env
    NCCL_DEBUG: False
  MACHINE:
    DEVICE: gpu
  CHECKPOINT:
    DIR: "."
    AUTO_RESUME: True
    USE_LAST: True
    CHECKPOINT_FREQUENCY: 1
    USE_SYMLINK_CHECKPOINT_FOR_RESUME: False
    CHECKPOINT_ITER_FREQUENCY: -1  # set this variable to checkpoint every few iterations

Thank you for the reply!

Providing train_config.yaml

Some configuration names(e.g., PATCH_SIZE, index_by orUSE_LAST) are new to you, but plz ignore them. We just tweaked something to be compatible with our codebase. It doesn't affect the training procedure. We tried to run the experiment with the default setting of barlow-twins.yaml file that you provided.

Loss graph The problem(loss NaN) suddenly pops up for some reason. The loss graph was pretty healthy. But after a few epochs, specifically around 120epochs, the loss NaN problem happened.

Plus, After recognizing issue happened in batchnorm layer, i found that apex amp provides keep_batchnorm_fp32 option. Do you think this option could be a solution?? reference link: https://nvidia.github.io/apex/amp.html#properties

QuentinDuval commented 2 years ago

Hi @Jeffkang-94,

Yes, so the loss looks pretty good indeed... I was thinking about clipping some gradients, but it will not solve anything there.

One thing you can try is to identify if you can find a particular image that throws the model off.

Does it happen always at the same iteration?
Like, does it always end up doing NAN on the same image?

Otherwise, I think having enabling the option to have BN in fp32 is definitely a road to take. We actually used tricks like this for LayerNorm in some other places:

class Fp32LayerNorm(nn.LayerNorm):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def forward(self, input: torch.Tensor):
        output = F.layer_norm(
            input.float(),
            self.normalized_shape,
            self.weight.float() if self.weight is not None else None,
            self.bias.float() if self.bias is not None else None,
            self.eps,
        )
        return output.type_as(input)

Could you try the AMP option or this kind of trick and tell me if that works better?

A last option some people use is the following: whenever you see a NAN, just skip the backward and update of the model, and move to the next batch. But I think this is last recourse (better invest on the previously mentioned options)

Thank you, Quentin

Jeffkang-94 commented 2 years ago

Yes, it happened in the same iteration number
It seems, yes since we use the same seed value.

Input image generally seems to be okay-ish. Moreover, as you already mentioned, the gradient clipping cannot iron out the issue.

I will try to apply the fp32 norm layer to make sure that the value will not be disappeared. Thank you for sharing your hack, and i will get back to you.

Thank you, Jeff kang

Jeffkang-94 commented 2 years ago

FYI, throughout our studies, we found that exclude_bias_and_norm: True in the LARS is able to prevent loss NaN value.

facebookresearch / vissl

loss NaN during the training #543

Instructions To Reproduce the 🐛 Bug: