ZenXir commented 1 year ago

大佬老师 finetune 663M 语料时，需要什么样的显卡配置？

/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
  0%|                                                                                                                                                             | 0/16029 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/mnt/e/Chinese-Vicuna/finetune.py", line 216, in <module>
    trainer.train()
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/trainer.py", line 2659, in training_step
    self.scaler.scale(loss).backward()
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 127 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.
  0%|                                                                                                                                                             | 0/16029 [00:23<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27252) of binary: /root/anaconda3/envs/Chinese-alpaca-lora/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/Chinese-alpaca-lora/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-24_14:27:50
  host      : DESKTOP-6KDJTBC.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 27252)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

ZenXir commented 1 year ago

finetune.sh 脚本内容是这样的：

TOT_CUDA="0"
CUDAs=(${TOT_CUDA//,/ })
CUDA_NUM=${#CUDAs[@]}
PORT="1234"

DATA_PATH="sample/merge.json"
#DATA_PATH="./sample/merge_sample.json"
OUTPUT_PATH="lora-Vicuna"
MODEL_PATH="/mnt/e/zllama-models/llama-7b-hf"

CUDA_VISIBLE_DEVICES=${TOT_CUDA} torchrun --nproc_per_node=$CUDA_NUM --master_port=$PORT finetune.py \
--data_path $DATA_PATH \
--output_path $OUTPUT_PATH \
--model_path $MODEL_PATH \
--eval_steps 200 \
--save_steps 200 \
--test_size 10000

Facico commented 1 year ago

@ZenXir 你试试用单卡的配置跑会不会报错，因为单卡没有必要使用ddp。或者在运行的python脚本前面加上TORCH_DISTRIBUTED_DEBUG=DETAIL看有没有更详细的报错信息返回。或者试试将TrainingArguments中的ddp_find_unused_parameters=False if ddp else None改成ddp_find_unused_parameters=True看能不能运行成功

ZenXir commented 1 year ago

把finetune.sh 脚本加了一行：TORCH_DISTRIBUTED_DEBUG=DETAIL

finetune.sh内容是：

TORCH_DISTRIBUTED_DEBUG=DETAIL
TOT_CUDA="0"
CUDAs=(${TOT_CUDA//,/ })
CUDA_NUM=${#CUDAs[@]}
PORT="1234"

DATA_PATH="sample/merge.json"
#DATA_PATH="./sample/merge_sample.json"
OUTPUT_PATH="lora-Vicuna"
MODEL_PATH="/mnt/e/zllama-models/llama-7b-hf"

CUDA_VISIBLE_DEVICES=${TOT_CUDA} torchrun --nproc_per_node=$CUDA_NUM --master_port=$PORT finetune.py \
--data_path $DATA_PATH \
--output_path $OUTPUT_PATH \
--model_path $MODEL_PATH \
--eval_steps 200 \
--save_steps 200 \
--test_size 10000

把 finetune.py 的 ddp_find_unused_parameters=False if ddp else None, 改成了

ddp_find_unused_parameters=True,

finetune 时仍是报错大佬老师：

CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/mnt/e/zllama-models/llama-7b-hf
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:23<00:00,  1.43it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-355c1d1e45d5609a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 75.73it/s]
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/json/default-355c1d1e45d5609a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-bac0d5c93ada8bc8.arrow and /root/.cache/huggingface/datasets/json/default-355c1d1e45d5609a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-ede35f34d12551ce.arrow
/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
  0%|                                                                                                                                                             | 0/16029 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/mnt/e/Chinese-Vicuna/finetune.py", line 216, in <module>
    trainer.train()
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/trainer.py", line 2659, in training_step
    self.scaler.scale(loss).backward()
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 127 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.
  0%|                                                                                                                                                             | 0/16029 [00:24<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27375) of binary: /root/anaconda3/envs/Chinese-alpaca-lora/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/Chinese-alpaca-lora/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/torch-2.0.0-py3.9-linux-x86_64.egg/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-24_15:47:13
  host      : DESKTOP-6KDJTBC.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 27375)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

另外大佬老师我在 vicuna.cpp/CMakeList.txt 文件的最会两行加了 pthread 库

target_link_libraries(quantize PRIVATE ggml pthread)
target_link_libraries(chat PRIVATE ggml pthread)

解决make时链接 pthrea_xxx 库函数找不到的问题，这修改可以不？

LZY-the-boys commented 1 year ago

非常感谢你的提醒，我们已经更新了CMakeList.txt

ZenXir commented 1 year ago

大佬老师有时间了帮我看看如果用 RTX4090 24G 单卡 finetune吧我尝试了几次，一直报上面的错误语料用的是百度网盘下载的 663M的那个

LZY-the-boys commented 1 year ago

可以直接python finetune.py --data_path $DATA_PATH --output_path $OUTPUT_PATH --model_path $MODEL_PATH 看下有没有报错

ZenXir commented 1 year ago

可以训练大佬老师，应该是我指定 --test_size 10000 太大了默认是200 就可以正常训练了

　checkpoint-8000 4000 3８00 这个数字含义是什么？

也帮我看看和解释下面这些参数含义吧大佬老师，担心自己理解的不到位，非常感谢

这个是finetune 的参数：

parser.add_argument("--eval_steps", type=int, default=200)
parser.add_argument("--save_steps", type=int, default=200)
parser.add_argument("--test_size", type=int, default=200)

MICRO_BATCH_SIZE = 4  # this could actually be 5 but i like powers of 2
BATCH_SIZE = 128
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
EPOCHS = 3  # we don't always need 3 tbh
LEARNING_RATE = 3e-4  # the Karpathy constant
CUTOFF_LEN = 256  # 256 accounts for about 96% of the data
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
VAL_SET_SIZE = args.test_size #2000
TARGET_MODULES = [
    "q_proj",
    "v_proj",
]

这个是 generate 时 evaluate 函数参数

def evaluate(
    input,
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    num_beams=4,
    max_new_tokens=128,
    repetition_penalty=1.0,
    **kwargs,
):

LZY-the-boys commented 1 year ago

checkpoint-8000的8000代表我们目前训练的步数， finetune的参数

MICRO_BATCH_SIZE = 4  # this could actually be 5 but i like powers of 2
BATCH_SIZE = 128
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
EPOCHS = 3  # we don't always need 3 tbh
LEARNING_RATE = 3e-4  # the Karpathy constant
CUTOFF_LEN = 256  # 256 accounts for about 96% of the data

设置的是批次大小、梯度累积、训练轮数、学习率和我们训练使用的文本长度

LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
VAL_SET_SIZE = args.test_size #2000
TARGET_MODULES = [
    "q_proj",
    "v_proj",
]

是lora相关的配置，详细可以看这里

generate的参数来源于huggingface，文档的位置在这里，

另外，我们使用单卡3090，使用与你相同的环境python=3.9 torch=2.0.0用torchrun（也就是那个finetune.sh）在Windows WSL和Linux上都可以正常训练，以下是我们的环境配置：

conda create -n py3.9 python=3.9
pip install torch==2.0.0
pip install  bitsandbytes datasets accelerate loralib sentencepiece
pip install  git+https://github.com/huggingface/transformers.git@main # must!
pip install git+https://github.com/huggingface/peft.git

如果你这边用torchrun还不能够运行, 并且有这方面需求的话，可以给我们一份你的环境配置

conda env export > py39.yaml
pip freeze > packages.txt

ZenXir commented 1 year ago

好的　可以运行了　非常感谢大佬老师

Facico / Chinese-Vicuna

finetune 报错，配置是：单张RTX4090 24G显存，语料是：从百度网盘下载下来的 663M大小的 merge.json 文件 #4

finetune.sh 脚本内容是这样的：

finetune.sh内容是：

把 finetune.py 的 ddp_find_unused_parameters=False if ddp else None, 改成了

finetune 时仍是报错大佬老师：

另外大佬老师我在 vicuna.cpp/CMakeList.txt 文件的最会两行加了 pthread 库

可以训练大佬老师，应该是我指定 --test_size 10000 太大了默认是200 就可以正常训练了

checkpoint-8000 4000 3８00 这个数字含义是什么？

这个是finetune 的参数：

这个是 generate 时 evaluate 函数参数

Facico / Chinese-Vicuna

finetune 报错，配置是：单张RTX4090 24G显存，语料是：从百度网盘下载下来的 663M大小的 merge.json 文件 #4

finetune.sh 脚本内容是这样的：

finetune.sh内容是：

把 finetune.py 的 ddp_find_unused_parameters=False if ddp else None, 改成了

finetune 时仍是报错大佬老师：

另外大佬老师 我在 vicuna.cpp/CMakeList.txt 文件的最会两行加了 pthread 库

可以训练大佬老师，应该是我指定 --test_size 10000 太大了 默认是200 就可以 正常训练了

checkpoint-8000 4000 3８00 这个数字含义是什么？

这个是finetune 的参数：

这个是 generate 时 evaluate 函数参数

另外大佬老师我在 vicuna.cpp/CMakeList.txt 文件的最会两行加了 pthread 库

可以训练大佬老师，应该是我指定 --test_size 10000 太大了默认是200 就可以正常训练了

　checkpoint-8000 4000 3８00 这个数字含义是什么？