FederatedAI / FATE-LLM

Federated Learning for LLMs.
Apache License 2.0
144 stars 25 forks source link

在非docker环境的本机standalone_fate环境跑offsite-tuning报错ModuleNotFoundError: No module named 'eggroll #37

Closed strongman1995 closed 10 months ago

strongman1995 commented 10 months ago

报错如下 Traceback (most recent call last): File "/home/chenlu/workspace/standalone_fate_install_1.11.3_release/fateflow/python/fate_flow/controller/task_controller.py", line 216, in kill_task backend_engine.kill(task) File "/home/chenlu/workspace/standalone_fate_install_1.11.3_release/fateflow/python/fate_flow/controller/engine_controller/deepspeed.py", line 134, in kill from eggroll.deepspeed.submit import client ModuleNotFoundError: No module named 'eggroll'

我想问一下standalone环境可以跑fate-llm吗,还是必须需要cluster方式安装才可以?

mgqa34 commented 10 months ago

standalone下跑的话可以参考gpt2的样例,直接使用单个cuda的训练模式,不能使用deepspeed的集群训练功能,你的报错应该是尝试去提交基于ds的联邦训练任务。

strongman1995 commented 10 months ago

@mgqa34 我尝试gpt2的样例,在跑local test那部分的时候,有如下报错,我在想是不是peft的版本问题还是其他问题呢?我的gpt2是从huggingface(https://huggingface.co/gpt2/tree/main)下载的[pytorch_model.bin](https://huggingface.co/gpt2/blob/main/pytorch_model.bin)、[config.json](https://huggingface.co/gpt2/blob/main/config.json)、[tokenizer.json](https://huggingface.co/gpt2/blob/main/tokenizer.json)、[vocab.json](https://huggingface.co/gpt2/blob/main/vocab.json)

image image image

strongman1995 commented 10 months ago

我发现是因为加载了两次GPT2类就会出现这个问题