daegonYu commented 5 months ago

Any help would be greatly appreciated. This error appears when running unified_finetune. Why do I get the error "tried to get lr value before scheduler/optimizer started stepping, returning lr=0"?

If you want to use CUTLASS, you can use it starting from CUDA 11.4, but is this because you are using 11.2?

Below is the ds_config.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 12,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 0
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Below is the log

2024-05-29 17:18:26.323510: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-05-29 17:18:26.340125: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [2024-05-29 17:18:28,187] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-29 17:18:28,204] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (3.0.0+45fff310c8), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (3.0.0+45fff310c8), only 1.0.0 is known to be compatible [2024-05-29 17:18:28,706] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-29 17:18:28,721] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-29 17:18:28,721] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 05/29/2024 17:18:28 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True 05/29/2024 17:18:28 - INFO - main - Training/evaluation parameters RetrieverTrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, colbert_dim=-1, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/home/NLP/sentence_similarity/FlagEmbedding/examples/finetune/ds_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, enable_sub_batch=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_steps=None, evaluation_strategy=no, fix_encoder=False, fix_position_embedding=False, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/home/NLP/sentence_similarity/saved_models/unified_finetune/runs/May29_17-18-28_Brian3090, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=5, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, negatives_cross_device=True, no_cuda=False, normlized=True, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=/home/NLP/sentence_similarity/saved_models/unified_finetune, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=128, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/home/NLP/sentence_similarity/saved_models/unified_finetune, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=5000, save_strategy=steps, save_total_limit=None, seed=42, self_distill_start_step=-1, sentence_pooling_method=cls, skip_memory_metrics=True, split_batches=None, temperature=0.05, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, unified_finetuning=True, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, use_self_distill=True, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.01, ) 05/29/2024 17:18:28 - INFO - main - Model parameters ModelArguments(model_name_or_path='monologg/kobigbird-bert-base', config_name=None, tokenizer_name=None, cache_dir=None) 05/29/2024 17:18:28 - INFO - main - Data parameters DataArguments(knowledge_distillation=False, train_data=['/home/NLP/sentence_similarity/FlagEmbedding/data'], cache_path='/home/.cache', train_group_size=1, query_max_len=50, passage_max_len=512, max_example_num_per_dataset=None, query_instruction_for_retrieval=None, passage_instruction_for_retrieval=None, same_task_within_batch=True, shuffle_ratio=0.002, small_threshold=0, drop_threshold=0) 05/29/2024 17:18:28 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True 05/29/2024 17:18:29 - INFO - main - Config: BigBirdConfig { "_name_or_path": "monologg/kobigbird-bert-base", "architectures": [ "BigBirdForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "attention_type": "block_sparse", "block_size": 64, "bos_token_id": 5, "classifier_dropout": null, "eos_token_id": 6, "gradient_checkpointing": false, "hidden_act": "gelu_new", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "intermediate_size": 3072, "label2id": { "LABEL_0": 0 }, "layer_norm_eps": 1e-12, "max_position_embeddings": 4096, "model_type": "big_bird", "num_attention_heads": 12, "num_hidden_layers": 12, "num_random_blocks": 3, "pad_token_id": 0, "position_embedding_type": "absolute", "rescale_embeddings": false, "sep_token_id": 3, "tokenizer_class": "BertTokenizer", "torch_dtype": "float32", "transformers_version": "4.40.0", "type_vocab_size": 2, "use_bias": true, "use_cache": true, "vocab_size": 32500 }

Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 120989.54it/s] Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 101475.10it/s] 05/29/2024 17:18:29 - INFO - FlagEmbedding.BGE_M3.modeling - The parameters of colbert_linear and sparse linear is new initialize. Make sure the model is loaded for training, not inferencing

Batch Size Dict: ['0-500: 700', '500-1000: 570', '1000-2000: 388', '2000-3000: 288', '3000-4000: 224', '4000-5000: 180', '5000-6000: 157', '6000-7000: 128', '7000-inf: 104']

loading data from /home/brianjang7/home1/NLP/sentence_similarity/FlagEmbedding/data/kowiki_contrastive_learning_data_adjacententailment_neg.jsonl ... ---------------------------Rank 1: refresh data--------------------------- ---------------------------Rank 0: refresh data--------------------------- Using /home/brianjang7/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /home/brianjang7/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/brianjang7/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.08048725128173828 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10111331939697266 seconds 0%| | 0/2950 [00:00<?, ?it/s]Attention type 'block_sparse' is not possible if sequence_length: 50 <= num global tokens: 2 config.block_size + min. num sliding tokens: 3 config.block_size + config.num_random_blocks config.block_size + additional buffer: config.num_random_blocks config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'... Attention type 'block_sparse' is not possible if sequence_length: 50 <= num global tokens: 2 config.block_size + min. num sliding tokens: 3 config.block_size + config.num_random_blocks config.block_size + additional buffer: config.num_random_blocks config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'... /home/brianjang7/home1/anaconda3/envs/flag/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /home/brianjang7/home1/anaconda3/envs/flag/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( 0%|▎ | 5/2950 [00:11<1:49:13, 2.23s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0 tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 3.0039, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.0}
0%|▌ | 10/2950 [00:22<1:46:08, 2.17s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0 tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 3.014, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.0}
1%|▊ | 15/2950 [00:33<1:45:44, 2.16s/it]

Below is the pip list

absl-py 2.1.0 accelerate 0.29.3 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.6.0 anyio 4.3.0 asttokens 2.4.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.2.0 beautifulsoup4 4.12.3 beir 2.0.0 bleach 6.1.0 blis 0.7.11 bokeh 3.4.1 bs4 0.0.2 C-MTEB 1.1.1 cachetools 5.3.3 catalogue 2.0.10 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 cloudpathlib 0.16.0 cloudpickle 3.0.0 colorcet 3.1.0 coloredlogs 15.0.1 comm 0.2.2 confection 0.1.4 contourpy 1.2.1 cycler 0.12.1 cymem 2.0.8 Cython 3.0.10 dask 2024.5.1 dask-expr 1.1.1 datasets 2.19.0 datashader 0.16.1 datasketch 1.6.4 debugpy 1.8.1 decorator 5.1.1 deepspeed 0.14.2 defusedxml 0.7.1 dill 0.3.8 distro 1.9.0 elasticsearch 7.9.1 eval_type_backport 0.2.0 exceptiongroup 1.2.0 executing 2.0.1 faiss 1.8.0 fastjsonschema 2.19.1 filelock 3.13.4 FlagEmbedding 1.2.10 /home/NLP/sentence_similarity/FlagEmbedding flatbuffers 24.3.25 fonttools 4.51.0 frozenlist 1.4.1 fsspec 2024.3.1 gast 0.4.0 google-auth 2.29.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 grpcio 1.62.2 h11 0.14.0 h5py 3.10.0 hjson 3.1.0 holoviews 1.18.3 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.22.2 humanfriendly 10.0 idna 3.7 imageio 2.34.1 importlib_metadata 7.1.0 ipykernel 6.29.3 ipython 8.22.2 jedi 0.19.1 Jinja2 3.1.3 joblib 1.4.0 JPype1 1.5.0 jsonlines 4.0.0 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 jupyter_client 8.6.1 jupyter_core 5.7.2 jupyterlab_pygments 0.3.0 keras 2.11.0 kiwisolver 1.4.5 konlpy 0.6.0 kss 2.6.0 langcodes 3.3.0 lazy_loader 0.4 libclang 18.1.1 lightgbm 4.3.0 linkify-it-py 2.0.3 llvmlite 0.42.0 locket 1.0.0 lxml 5.2.1 Markdown 3.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.0 matplotlib-inline 0.1.7 mdit-py-plugins 0.4.1 mdurl 0.1.2 mecab-python 1.0.0 mecab-python3 1.0.9 mistune 3.0.2 mpmath 1.3.0 mteb 1.1.1 multidict 6.0.5 multipledispatch 1.0.0 multiprocess 0.70.16 murmurhash 1.0.10 nbclient 0.10.0 nbconvert 7.16.4 nbformat 5.10.4 nest_asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.8.1 nmslib 2.1.1 numba 0.59.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 onnxruntime 1.17.3 openai 1.23.2 opt-einsum 3.3.0 packaging 24.0 pandas 2.2.2 pandocfilters 1.5.1 panel 1.4.2 param 2.1.0 parso 0.8.4 partd 1.4.2 pexpect 4.9.0 pickleshare 0.7.5 pillow 10.3.0 pip 24.0 platformdirs 4.2.1 plotly 5.22.0 preshed 3.0.9 prompt-toolkit 3.0.42 protobuf 3.19.6 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 16.0.0 pyarrow-hotfix 0.6 pyasn1 0.6.0 pyasn1_modules 0.4.0 pybind11 2.6.1 pyct 0.5.0 pydantic 2.7.0 pydantic_core 2.18.1 Pygments 2.17.2 pyjnius 1.6.1 pykospacing 0.5 pynndescent 0.5.12 pynvml 11.5.0 pyodbc 5.1.0 pyparsing 3.1.2 pyserini 0.35.0 python-dateutil 2.9.0 pytorch-triton 3.0.0+45fff310c8 pytrec-eval 0.5 pytrec-eval-terrier 0.5.6 pytz 2024.1 pyviz_comms 3.0.2 PyYAML 6.0.1 pyzmq 26.0.2 rank-bm25 0.2.2 referencing 0.35.1 regex 2024.4.16 requests 2.31.0 requests-oauthlib 2.0.0 rich 13.7.1 rpds-py 0.18.1 rsa 4.9 safetensors 0.4.3 scikit-image 0.23.2 scikit-learn 1.4.2 scipy 1.13.0 seaborn 0.13.2 sentence-transformers 2.7.0 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 smart-open 6.4.0 sniffio 1.3.1 soupsieve 2.5 spacy 3.7.4 spacy-legacy 3.0.12 spacy-loggers 1.0.5 srsly 2.4.8 stack-data 0.6.2 sympy 1.12 tenacity 8.3.0 tensorboard 2.11.2 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorflow 2.11.1 tensorflow-estimator 2.11.0 tensorflow-io-gcs-filesystem 0.36.0 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 termcolor 2.4.0 thinc 8.2.3 threadpoolctl 3.4.0 tifffile 2024.5.22 tiktoken 0.6.0 tinycss2 1.3.0 tokenizers 0.19.1 toolz 0.12.1 torch 2.2.2 tornado 6.4 tqdm 4.66.2 traitlets 5.14.3 transformers 4.40.0 typer 0.9.4 typing_extensions 4.11.0 tzdata 2024.1 uc-micro-py 1.0.3 umap-learn 0.5.6 urllib3 2.2.1 wasabi 1.1.2 wcwidth 0.2.13 weasel 0.3.4 webencodings 0.5.1 Werkzeug 3.0.2 wheel 0.41.2 wrapt 1.16.0 xarray 2024.5.0 xxhash 3.4.1 xyzservices 2024.4.0 yarl 1.9.4 zipp 3.17.0

staoxiao commented 5 months ago

We haven't met this error. You can refer to the discussion for other repos: https://github.com/LianjiaTech/BELLE/issues/134

daegonYu commented 5 months ago

thank you The page you sent told me to delete the fp16=True setting, so I tried that and it worked normally.

FlagOpen / FlagEmbedding

I get the error "tried to get lr value before scheduler/optimizer started stepping, returning lr=0" #831

Below is the log

Batch Size Dict: ['0-500: 700', '500-1000: 570', '1000-2000: 388', '2000-3000: 288', '3000-4000: 224', '4000-5000: 180', '5000-6000: 157', '6000-7000: 128', '7000-inf: 104']