训练GPT模型卡住 - Githubissues

zapjone commented 4 months ago

FATE-LLM训练GPT模型时，卡住在这里不动了，最开始以为是资源问题，使用了2台机器上跑，每台1块GPU，结果还是卡住，没报错，也没日志输出。哪位大佬知道怎么调整吗？

mgqa34 commented 4 months ago

请问是怎么部署的呢？Eggroll的配置正常吗？

zapjone commented 4 months ago

请问是怎么部署的呢？Eggroll的配置正常吗？

部署在2台机器上（每台一块GPU），提交其他任务都可以正常执行成功，按照教程中的这个进行提交的任务就卡住 https://github.com/FederatedAI/FATE-LLM/blob/main/doc/tutorial/parameter_efficient_llm/ChatGLM3-6B_ds.ipynb

下面是我提交的任务代码：

import time from fate_client.pipeline.components.fate.reader import Reader from fate_client.pipeline import FateFlowPipeline from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_config_of_seq2seq_runner from fate_client.pipeline.components.fate.nn.algo_params import Seq2SeqTrainingArguments, FedAVGArguments from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader from peft import LoraConfig, TaskType

guest = '9999' host = '10000' arbiter = '10000'

epochs = 1 batch_size = 1 lr = 5e-4

ds_config = { "train_micro_batch_size_per_gpu": batch_size, "optimizer": { "type": "Adam", "params": { "lr": lr, "torch_adam": True, "adam_w_mode": False } }, "fp16": { "enabled": True }, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 2, "allgather_partitions": True, "allgather_bucket_size": 1e8, "overlap_comm": True, "reduce_scatter": True, "reduce_bucket_size": 1e8, "contiguous_gradients": True, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" } } }

pipeline = FateFlowPipeline().set_parties(guest=guest, host=host, arbiter=arbiter) time.sleep(5)

reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest, host=host)) reader_0.guest.task_parameters( namespace="experiment", name="ad" ) reader_0.hosts[0].task_parameters( namespace="experiment", name="ad" )

lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules=['query_key_value'], ) lora_config.target_modules = list(lora_config.target_modules)

pretrained_model_path = "/data/projects/fate/examples/model_dir/chatglm3-6b"

model = LLMModelLoader( "pellm.chatglm", "ChatGLM", pretrained_path=pretrained_model_path, pre_seq_len=128, trust_remote_code=True )

tokenizer_params = dict( tokenizer_name_or_path=pretrained_model_path, trust_remote_code=True, )

dataset = LLMDatasetLoader( "prompt_dataset", "PromptDataset", **tokenizer_params, )

data_collator = LLMDataFuncLoader( "data_collator.cust_data_collator", "get_seq2seq_data_collator", **tokenizer_params, )

conf = get_config_of_seq2seq_runner( algo='fedavg', model=model, dataset=dataset, data_collator=data_collator, training_args=Seq2SeqTrainingArguments( num_train_epochs=epochs, per_device_train_batch_size=batch_size, remove_unused_columns=False, predict_with_generate=False, deepspeed=ds_config, learning_rate=lr, use_cpu=False, fp16=True, ), fed_args=FedAVGArguments(), task_type='causal_lm', save_trainable_weights_only=True )

homo_nn_0 = HomoNN( 'nn_0', runner_conf=conf, train_data=reader_0.outputs["output_data"], runner_module="homo_seq2seq_runner", runner_class="Seq2SeqRunner", )

homo_nn_0.guest.conf.set("launcher_name", "deepspeed") homo_nn_0.hosts[0].conf.set("launcher_name", "deepspeed")

pipeline.add_tasks([reader_0, homo_nn_0]) pipeline.conf.set("task", dict(engine_run={"cores": 1,"timeout_seconds":600,"resource_exhausted_strategy":"throw_error"}))

pipeline.compile() pipeline.fit()

mgqa34 commented 4 months ago

这里问的是用哪个部署包链接哈，因为其他的几个算法可能没有用到deepspeed

zapjone commented 4 months ago

这里问的是用哪个部署包链接哈，因为其他的几个算法可能没有用到deepspeed

自己按照FATE-Builder进行打的镜像，代码是使用的最新的2.1的代码（FATE）

zapjone commented 4 months ago

robbie228 commented 4 months ago

这里问的是用哪个部署包链接哈，因为其他的几个算法可能没有用到deepspeed

自己按照FATE-Builder进行打的镜像，代码是使用的最新的2.1的代码（FATE）

使用FATE-Builder打包的时候，如果需要打包fate-llm，需要设置一下变量： LLM_DIR= //fate-llm的目录 PACK_LLM=1 //1代表需要打包fate-llm LLM_VER=2.0.0 //fate-llm的版本号

zapjone commented 4 months ago

这里问的是用哪个部署包链接哈，因为其他的几个算法可能没有用到deepspeed

自己按照FATE-Builder进行打的镜像，代码是使用的最新的2.1的代码（FATE）

使用FATE-Builder打包的时候，如果需要打包fate-llm，需要设置一下变量： LLM_DIR= //fate-llm的目录 PACK_LLM=1 //1代表需要打包fate-llm LLM_VER=2.0.0 //fate-llm的版本号

设置了的，最后使用的fateflow-all-gpu和eggroll-all-gpu的镜像

FederatedAI / FATE-LLM

训练GPT模型卡住 #75