loss is nan :( - Githubissues

xxxxyliu commented 5 months ago

Hi, I changed the dataset to fMow-full with eight channels and wanted to perform pre-training again, but encountered a "loss is nan" issue at the 27th epoch. I also tried adding pre-trained weights with spectral.pth for training, but the "loss is nan" problem persists. I would like to inquire if such an issue occurs during your training process as well. If so, how do you resolve it? I would be extremely grateful if you could provide an answer. This is my command:

torchrun --nproc_per_node=2 --master_port=29501 main_pretrain.py \
--wandb spectralgpt_pretrain_stage-full8 \
--batch_size 64 --accum_iter 32 --blr 0.0002 \
--epochs 200 --warmup_epochs 20 --num_workers 16 \
--input_size 96 --patch_size 8 \
--mask_ratio 0.90 \
--model_type tensor \
--model mae_vit_base_patch8_96 \
--dataset_type sentinel \
--train_path /data/DATASET/fMoW-full/train/8_channels_img.csv \
--output_dir /data/LXY/pretrain_fmow_full_spth+ \
--log_dir /data/LXY/pretrain_fmow_full_spth+

moonboy12138 commented 5 months ago

We've encountered a similar issue during pretraining. One simple solution is to decrease the learning rate. If that doesn't resolve the issue, we recommend using patch normalization, which we've found to be effective in recent experiments.

xxxxyliu commented 5 months ago

We've encountered a similar issue during pretraining. One simple solution is to decrease the learning rate. If that doesn't resolve the issue, we recommend using patch normalization, which we've found to be effective in recent experiments.

Thank you for your reply. I tried reducing my learning rate and have resolved the issue.

danfenghong / IEEE_TPAMI_SpectralGPT

loss is nan :( #16