Learning rate fixed at 0 during training via DeepSpeed

I followed all the setup instruction given in the README. The command I am using is:

deepspeed train.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path Llama-X/data/alpaca_data.json \
    --output_dir ./model_weights_finetuned \
    --num_train_epochs 3 \
    --model_max_length 512 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 2 \
    --logging_steps 2 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/deepspeed_config.json \
    --fp16 True

Initially, I got the following error:

ValueError: Found `optimizer` configured in the DeepSpeed config, but no `scheduler`. Please configure a scheduler in the DeepSpeed config.

I downgraded to transformers version 4.29.2 as suggested here.

Now, training is happening but the learning rate from the beginning itself is fixed to zero. Below are the logs:

[2023-08-28 04:36:42,566] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:43,585] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-28 04:36:43,585] [INFO] [runner.py:555:main] cmd = /home/anmol/anaconda3/envs/llamax/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --model_name_or_path meta-llama/Llama-2-7b-hf --data_path /home/anmol/TieredModels/code/08_llamax_approach/Llama-X/data/alpaca_data.json --output_dir ./model_weights_finetuned --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 64 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 2 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed configs/deepspeed_config.json --fp16 True
[2023-08-28 04:36:44,193] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,187] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-08-28 04:36:45,187] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-08-28 04:36:45,187] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-08-28 04:36:45,187] [INFO] [launch.py:163:main] dist_world_size=4
[2023-08-28 04:36:45,187] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-08-28 04:36:45,955] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,977] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,980] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,996] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:47,765] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,765] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,765] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-28 04:36:47,772] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,772] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,773] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,773] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,801] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,801] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:54,794] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.97s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Running tokenizer on train dataset (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52002/52002 [00:02<00:00, 24943.76 examples/s]
52002
Sample 12208 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 4391, 385, 4544, 1813, 411, 263, 28435, 322, 263, 1014, 2813, 292, 13, 13, 2277, 29937, 13291, 29901, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2]}.
Sample 46872 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 29871, 29941, 29899, 29946, 10541, 5828, 1048, 263, 285, 9102, 1058, 27401, 278, 2462, 29889, 13, 13, 2277, 29937, 13291, 29901, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2]}.
Sample 4920 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29892, 3300, 2859, 411, 385, 1881, 393, 8128, 4340, 3030, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 740, 297, 5132, 304, 7252, 1023, 6031, 29889, 13, 13, 2277, 29937, 10567, 29901, 13, 1576, 1023, 6031, 526, 525, 11548, 29915, 322, 525, 272, 927, 4286, 13, 13, 2277, 29937, 13291, 29901, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2]}.
[2023-08-28 04:37:06,368] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,370] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,376] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,377] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/TH -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/THC -isystem /home/anmol/anaconda3/envs/llamax/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -march=native -fopenmp -D__AVX256__ -D__DISABLE_CUDA__ -c /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.43460202217102 seconds
Time to load cpu_adam op: 16.424538373947144 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.438047647476196 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.528525590896606 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.06}

Does anyone have any idea on what I might be doing wrong ?

AetherCortex / Llama-X

Learning rate fixed at 0 during training via DeepSpeed #30