Training loss becomes NAN during pretraining

Hello, I’m trying to do pretraning on my own data. But after some number of steps (1k or more), the loss becomes NAN. I tried to reduce the learning rate from 6e-4 to 5e-4, 1e-4, 6e-5, but it only seems to delay the problem. My questions are :

Did you encounter the same problem during your work ?
Is gradient clipping implemented ? In my configuration, it’s supposed to be set to 1
Is there a part of the code where I can track the gradient norm to see if it’s vanishing/exploding ?
Can it be a float precision issue ? The precision is 16 in my case

Thank you.

Here is my configuration file :

I try to pre-process my own data. But after a certain number epochs, the loss becomes NAN. I've tried reducing the learning rate from from 6e-4 to 5e-4, 1e-4, 6e-5, but this only seems to delay the problem. problem. My questions are as follows:

CONFIG ├── train │ └── seed: 2222
│ interval: step
│ monitor: test/loss
│ mode: min
│ ema: 0.0
│ test: false
│ debug: false
│ ignore_warnings: false
│ state:
│ mode: null
│ n_context: 0
│ n_context_eval: 0
│ ckpt: null
│ disable_dataset: false
│ validate_at_start: false
│ pretrained_model_path: null
│ pretrained_model_strict_load: true
│ pretrained_model_state_hook:
│ name: null
│ post_init_hook:
│ name: null
│ layer_decay:
│ name: null
│ decay: 0.7
│ gpu_mem: 49
│ global_batch_size: 256
│
├── tolerance │ └── logdir: ./resume
│ id: null
│
├── wandb │ └── project: dna
│ group: ''
│ job_type: training
│ mode: online
│ name: my_data_32k_1
│ save_dir: .
│ id: my_data_32k_1
│
├── trainer │ └── target: pytorch_lightning.Trainer
│ devices: 4
│ accelerator: gpu
│ accumulate_grad_batches: 4
│ max_epochs: 400
│ gradient_clip_val: 1.0
│ log_every_n_steps: 10
│ limit_train_batches: 1.0
│ limit_val_batches: 1.0
│ num_nodes: 1
│ precision: 16
│
├── loader │ └── batch_size: 50
│ num_workers: 4
│ pin_memory: true
│ drop_last: true
│
├── dataset │ └── name: my_data
│ fasta_list_file: /ml/hyena-dna/data/my_data/paths.csv
│ dataset_name: my_data
│ tokenizer_name: char
│ cache_dir: null
│ max_length: 32768
│ add_eos: true
│ batch_size: 16
│ batch_size_eval: 32
│ num_workers: 12
│ shuffle: true
│ pin_memory: true
│ max_length_val: 32768
│ max_length_test: 32768
│ pad_max_length: null
│ rc_aug: false
│ use_fixed_len_val: false
│ replace_N_token: false
│ pad_interval: false
│
├── optimizer │ └── name: adamw
│ lr: 0.0006
│ weight_decay: 0.1
│ betas:
│ - 0.9
│ - 0.999
│
├── scheduler │ └── name: cosine_warmup_timm
│ t_in_epochs: false
│ t_initial: 48000
│ lr_min: 5.9999999999999995e-05
│ warmup_lr_init: 1.0e-06
│ warmup_t: 480.0
│
├── callbacks │ └── learning_rate_monitor:
│ logging_interval: step
│ timer:
│ step: true
│ inter_step: false
│ epoch: true
│ val: true
│ params:
│ total: true
│ trainable: true
│ fixed: true
│ model_checkpoint:
│ monitor: test/loss
│ mode: min
│ save_top_k: 1
│ save_last: true
│ dirpath: checkpoints/
│ filename: test/loss
│ auto_insert_metric_name: false
│ verbose: true
│
├── task │ └── name: lm
│ loss: cross_entropy
│ torchmetrics:
│ - perplexity
│ - num_tokens
│
├── encoder │ └── None
├── decoder │ └── None
└── model └── name: lm
d_model: 256
n_layer: 4
d_inner: 1024
vocab_size: 12
resid_dropout: 0.0
embed_dropout: 0.1
fused_mlp: false
fused_dropout_add_ln: false
checkpoint_mixer: true
checkpoint_mlp: true
residual_in_fp32: true
pad_vocab_size_multiple: 8
layer:
name: hyena
emb_dim: 5
filter_order: 64
short_filter_order: 3
l_max: 32770
modulate: true
w: 10
lr: 0.0006
wd: 0.0
lr_pos_emb: 0.0

HazyResearch / hyena-dna

Training loss becomes NAN during pretraining #68