AetherCortex / Llama-X

Open Academic Research on Improving LLaMA to SOTA LLM
Apache License 2.0
1.58k stars 101 forks source link

Learning rate fixed at 0 during training via DeepSpeed #30

Open anmolagarwal999 opened 10 months ago

anmolagarwal999 commented 10 months ago

I followed all the setup instruction given in the README. The command I am using is:

deepspeed train.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path Llama-X/data/alpaca_data.json \
    --output_dir ./model_weights_finetuned \
    --num_train_epochs 3 \
    --model_max_length 512 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 2 \
    --logging_steps 2 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/deepspeed_config.json \
    --fp16 True

Initially, I got the following error:

ValueError: Found `optimizer` configured in the DeepSpeed config, but no `scheduler`. Please configure a scheduler in the DeepSpeed config.

I downgraded to transformers version 4.29.2 as suggested here.

Now, training is happening but the learning rate from the beginning itself is fixed to zero. Below are the logs:

[2023-08-28 04:36:42,566] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:43,585] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-28 04:36:43,585] [INFO] [runner.py:555:main] cmd = /home/anmol/anaconda3/envs/llamax/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --model_name_or_path meta-llama/Llama-2-7b-hf --data_path /home/anmol/TieredModels/code/08_llamax_approach/Llama-X/data/alpaca_data.json --output_dir ./model_weights_finetuned --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 64 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 2 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed configs/deepspeed_config.json --fp16 True
[2023-08-28 04:36:44,193] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,187] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-08-28 04:36:45,187] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-08-28 04:36:45,187] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-08-28 04:36:45,187] [INFO] [launch.py:163:main] dist_world_size=4
[2023-08-28 04:36:45,187] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-08-28 04:36:45,955] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,977] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,980] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,996] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:47,765] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,765] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,765] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-28 04:36:47,772] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,772] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,773] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,773] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,801] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,801] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:54,794] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.97s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Running tokenizer on train dataset (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52002/52002 [00:02<00:00, 24943.76 examples/s]
52002
Sample 12208 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 4391, 385, 4544, 1813, 411, 263, 28435, 322, 263, 1014, 2813, 292, 13, 13, 2277, 29937, 13291, 29901, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2]}.
Sample 46872 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 29871, 29941, 29899, 29946, 10541, 5828, 1048, 263, 285, 9102, 1058, 27401, 278, 2462, 29889, 13, 13, 2277, 29937, 13291, 29901, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2]}.
Sample 4920 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29892, 3300, 2859, 411, 385, 1881, 393, 8128, 4340, 3030, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 740, 297, 5132, 304, 7252, 1023, 6031, 29889, 13, 13, 2277, 29937, 10567, 29901, 13, 1576, 1023, 6031, 526, 525, 11548, 29915, 322, 525, 272, 927, 4286, 13, 13, 2277, 29937, 13291, 29901, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2]}.
[2023-08-28 04:37:06,368] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,370] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,376] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,377] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/TH -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/THC -isystem /home/anmol/anaconda3/envs/llamax/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -march=native -fopenmp -D__AVX256__ -D__DISABLE_CUDA__ -c /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.43460202217102 seconds
Time to load cpu_adam op: 16.424538373947144 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.438047647476196 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.528525590896606 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.06}

Does anyone have any idea on what I might be doing wrong ?

apt-team-018 commented 9 months ago

i am getting same issue

azhx commented 8 months ago

I recently discovered that Llama1 was pretrained using fp16, but the llama2 family of models were pretrained with bf16. The readme in this repo has fp16 set as default. Switching to bf16 fixed this.

Ref. https://github.com/microsoft/DeepSpeed/issues/4017