I get the error "tried to get lr value before scheduler/optimizer started stepping, returning lr=0" #831

daegonYu commented 3 months ago

Any help would be greatly appreciated. This error appears when running unified_finetune. Why do I get the error "tried to get lr value before scheduler/optimizer started stepping, returning lr=0"?

If you want to use CUTLASS, you can use it starting from CUDA 11.4, but is this because you are using 11.2?

Below is the ds_config.json

    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 12,
        "hysteresis": 2,
        "min_loss_scale": 1

    "bf16": {
        "enabled": "auto"

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"

    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"

    "zero_optimization": {
        "stage": 0

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false

Below is the log

2024-05-29 17:18:26.323510: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [2024-05-29 17:18:28,187] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (3.0.0+45fff310c8), only 1.0.0 is known to be compatible
[2024-05-29 17:18:28,706] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-29 17:18:28,721] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True 05/29/2024 17:18:28 - INFO - main - Training/evaluation parameters RetrieverTrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, colbert_dim=-1, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/home/NLP/sentence_similarity/FlagEmbedding/examples/finetune/ds_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, enable_sub_batch=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_steps=None, evaluation_strategy=no, fix_encoder=False, fix_position_embedding=False, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/home/NLP/sentence_similarity/saved_models/unified_finetune/runs/May29_17-18-28_Brian3090, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=5, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, negatives_cross_device=True, no_cuda=False, normlized=True, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=/home/NLP/sentence_similarity/saved_models/unified_finetune, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=128, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/home/NLP/sentence_similarity/saved_models/unified_finetune, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=5000, save_strategy=steps, save_total_limit=None, seed=42, self_distill_start_step=-1, sentence_pooling_method=cls, skip_memory_metrics=True, split_batches=None, temperature=0.05, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, unified_finetuning=True, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, use_self_distill=True, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.01, ) 05/29/2024 17:18:28 - INFO - main - Model parameters ModelArguments(model_name_or_path='monologg/kobigbird-bert-base', config_name=None, tokenizer_name=None, cache_dir=None) 05/29/2024 17:18:28 - INFO - main - Data parameters DataArguments(knowledge_distillation=False, train_data=['/home/NLP/sentence_similarity/FlagEmbedding/data'], cache_path='/home/.cache', train_group_size=1, query_max_len=50, passage_max_len=512, max_example_num_per_dataset=None, query_instruction_for_retrieval=None, passage_instruction_for_retrieval=None, same_task_within_batch=True, shuffle_ratio=0.002, small_threshold=0, drop_threshold=0) 05/29/2024 17:18:28 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True 05/29/2024 17:18:29 - INFO - main - Config: BigBirdConfig { "_name_or_path": "monologg/kobigbird-bert-base", "architectures": [ "BigBirdForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "attention_type": "block_sparse", "block_size": 64, "bos_token_id": 5, "classifier_dropout": null, "eos_token_id": 6, "gradient_checkpointing": false, "hidden_act": "gelu_new", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "intermediate_size": 3072, "label2id": { "LABEL_0": 0 }, "layer_norm_eps": 1e-12, "max_position_embeddings": 4096, "model_type": "big_bird", "num_attention_heads": 12, "num_hidden_layers": 12, "num_random_blocks": 3, "pad_token_id": 0, "position_embedding_type": "absolute", "rescale_embeddings": false, "sep_token_id": 3, "tokenizer_class": "BertTokenizer", "torch_dtype": "float32", "transformers_version": "4.40.0", "type_vocab_size": 2, "use_bias": true, "use_cache": true, "vocab_size": 32500 }

Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 120989.54it/s] Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 101475.10it/s] 05/29/2024 17:18:29 - INFO - FlagEmbedding.BGE_M3.modeling - The parameters of colbert_linear and sparse linear is new initialize. Make sure the model is loaded for training, not inferencing

Batch Size Dict: ['0-500: 700', '500-1000: 570', '1000-2000: 388', '2000-3000: 288', '3000-4000: 224', '4000-5000: 180', '5000-6000: 157', '6000-7000: 128', '7000-inf: 104']

loading data from /home/brianjang7/home1/NLP/sentence_similarity/FlagEmbedding/data/kowiki_contrastive_learning_data_adjacententailment_neg.jsonl ... ---------------------------Rank 1: refresh data--------------------------- ---------------------------Rank 0: refresh data--------------------------- Using /home/brianjang7/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /home/brianjang7/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/brianjang7/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.08048725128173828 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10111331939697266 seconds 0%| | 0/2950 [00:00<?, ?it/s]Attention type 'block_sparse' is not possible if sequence_length: 50 <= num global tokens: 2 config.block_size + min. num sliding tokens: 3 config.block_size + config.num_random_blocks config.block_size + additional buffer: config.num_random_blocks config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'... Attention type 'block_sparse' is not possible if sequence_length: 50 <= num global tokens: 2 config.block_size + min. num sliding tokens: 3 config.block_size + config.num_random_blocks config.block_size + additional buffer: config.num_random_blocks config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'... /home/brianjang7/home1/anaconda3/envs/flag/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /home/brianjang7/home1/anaconda3/envs/flag/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( 0%|▎ | 5/2950 [00:11<1:49:13, 2.23s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0 tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 3.0039, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.0}
0%|▌ | 10/2950 [00:22<1:46:08, 2.17s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0 tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 3.014, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.0}
1%|▊ | 15/2950 [00:33<1:45:44, 2.16s/it]

Below is the pip list

staoxiao commented 3 months ago

We haven't met this error. You can refer to the discussion for other repos: https://github.com/LianjiaTech/BELLE/issues/134

daegonYu commented 3 months ago

thank you The page you sent told me to delete the fp16=True setting, so I tried that and it worked normally.