FederatedAI / FATE-LLM

Federated Learning for LLMs.
Apache License 2.0
161 stars 29 forks source link

Training error: INTERNAL ASSERT FAILED #126

Open hejxiang opened 1 week ago

hejxiang commented 1 week ago

When training the llm model according to the example, the following error occurred Qwen1.5-0.5B-Chat and chatglm3-6b had same error.

Please help me check where the problem is.

Thanks !!!

The system configuration and environment are as follows:

FATE-LLM v2.2.0 cluster, 3 machine,The toy example can be successfully run on multiple devices.

accelerate                               0.27.2
deepspeed                                0.13.3
peft                                     0.8.2
torch                                    2.3.1
transformers                             4.37.2

The error:

[ERROR][2024-11-11 14:16:22,115][403433][_wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, **execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute\n return self.callback(ctx, role, **kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(train_data_, validate_data_, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _prepare_deepspeed\n engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in __init__\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None}

The code:

import time
from fate_client.pipeline.components.fate.reader import Reader
from fate_client.pipeline import FateFlowPipeline
from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_config_of_seq2seq_runner
from fate_client.pipeline.components.fate.nn.algo_params import Seq2SeqTrainingArguments, FedAVGArguments
from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader
from peft import LoraConfig, TaskType

guest = '10000'
host = '10000'
arbiter = '10000'

epochs = 1
batch_size = 1
lr = 5e-4

ds_config = {
    "train_micro_batch_size_per_gpu": batch_size,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": lr,
            "torch_adam": True,
            "adam_w_mode": False
        }
    },
    "fp16": {
        "enabled": True
    },
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": True,
        "allgather_bucket_size": 1e8,
        "overlap_comm": True,
        "reduce_scatter": True,
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": True,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
        }
    }
}

pipeline = FateFlowPipeline().set_parties(guest=guest, host=host, arbiter=arbiter)
pipeline.bind_local_path(path="/ws/data/test/fate/FATE-LLM/examples/data/AdvertiseGen/train.json", namespace="experiment", name="ad")
time.sleep(5)

reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest, host=host))
reader_0.guest.task_parameters(
    namespace="experiment",
    name="ad"
)
reader_0.hosts[0].task_parameters(
    namespace="experiment",
    name="ad"
)

# define lora config
# lora_config = LoraConfig(
#     task_type=TaskType.CAUSAL_LM,
#     inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1,
#     target_modules=['query_key_value'],
# )

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1,
    target_modules=['q_proj'],
)

lora_config.target_modules = list(lora_config.target_modules)

# pretrained_model_path = "/ws/data/test/models/chatglm3-6b"

# model = LLMModelLoader(
#     "pellm.chatglm",
#     "ChatGLM",
#     pretrained_path=pretrained_model_path,
#     peft_type="LoraConfig",
#     peft_config=lora_config.to_dict(),
#     trust_remote_code=True
# )

pretrained_model_path = "/ws/data/test/models/Qwen1.5-0.5B-Chat"

model = LLMModelLoader(
    "pellm.qwen",
    "Qwen",
    pretrained_path=pretrained_model_path,
    peft_type="LoraConfig",
    peft_config=lora_config.to_dict(),
    trust_remote_code=True
)

tokenizer_params = dict(
    tokenizer_name_or_path=pretrained_model_path,
    trust_remote_code=True,
)

dataset = LLMDatasetLoader(
    "prompt_dataset",
    "PromptDataset",
    **tokenizer_params,
)

data_collator = LLMDataFuncLoader(
    "data_collator.cust_data_collator",
    "get_seq2seq_data_collator",
    **tokenizer_params,
)

conf = get_config_of_seq2seq_runner(
    algo='fedavg',
    model=model,
    dataset=dataset,
    data_collator=data_collator,
    training_args=Seq2SeqTrainingArguments(
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        remove_unused_columns=False, 
        predict_with_generate=False,
        deepspeed=ds_config,
        learning_rate=lr,
        use_cpu=False, # this must be set as we will gpu
        fp16=True,
    ),
    fed_args=FedAVGArguments(),
    task_type='causal_lm',
    save_trainable_weights_only=True # only save trainable weights
)

homo_nn_0 = HomoNN(
    'nn_0',
    runner_conf=conf,
    train_data=reader_0.outputs["output_data"],
    runner_module="homo_seq2seq_runner",
    runner_class="Seq2SeqRunner",
)

homo_nn_0.guest.conf.set("launcher_name", "deepspeed") # tell schedule engine to run task with deepspeed
homo_nn_0.hosts[0].conf.set("launcher_name", "deepspeed") # tell schedule engine to run task with deepspeed

pipeline.add_tasks([reader_0, homo_nn_0])
pipeline.conf.set("task", dict(engine_run={"cores": 1})) # the number of gpus of each party

pipeline.compile()
pipeline.fit()
mgqa34 commented 1 week ago

We need time to research and solve this problem. If you need to use deepspeed, please use the previous version, like fate_llm v2.1.0

hejxiang commented 1 week ago

Thanks for your reply, but I still encountered the same issue after using the fate_llm version 2.1.0. 😅

git checkout tags/v2.1.0
source /ws/data/test/fate/projects/fate/bin/init_env.sh
pip install -r requirements.txt
pip install -e .  
# also try install with --use-pep517

The requirements.txt in v2.1.0 didn't have torch or transformers

The error: `

[ERROR][2024-11-14 11:31:02,832][855754][_wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute\n return self.callback(ctx, role, kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(traindata, validatedata, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _preparedeepspeed\n engine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, *kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(args, kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None} 2 [ERROR][2024-11-14 11:31:02,832][855754][_wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute\n return self.callback(ctx, role, kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(traindata, validatedata, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _preparedeepspeed\n engine, optimizer, , lr_scheduler = deepspeed.initialize(*kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(args, kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, *kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None}

`

mgqa34 commented 3 days ago

Thanks for your reply, but I still encountered the same issue after using the fate_llm version 2.1.0. 😅

git checkout tags/v2.1.0
source /ws/data/test/fate/projects/fate/bin/init_env.sh
pip install -r requirements.txt
pip install -e .  
# also try install with --use-pep517

The requirements.txt in v2.1.0 didn't have torch or transformers

The error: `

[ERROR][2024-11-14 11:31:02,832][855754][_wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute\n return self.callback(ctx, role, kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(traindata, validatedata, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _preparedeepspeed\n engine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(*args, *kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(args, kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(*args, kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None} 2 [ERROR][2024-11-14 11:31:02,832][855754][_wraps.run][line:92]: {'status': {'code': -1, 'exceptions': 'Traceback (most recent call last):\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 151, in execute_component_from_config\n component.execute(ctx, role, execution_io.get_kwargs())\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute\n return self.callback(ctx, role, kwargs)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/homo_nn.py", line 63, in train\n train_procedure(\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 159, in train_procedure\n runner.train(traindata, validatedata, output_dir, saved_model_path)\n File "/ws/data/test/fate/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 272, in train\n trainer.train()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train\n return inner_training_loop(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1691, in _inner_training_loop\n model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare\n result = self._prepare_deepspeed(*args)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1606, in _preparedeepspeed\n engine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize\n engine = DeepSpeedEngine(args=args,\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init*\n self._configure_distributed_model(model)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model\n self._broadcast_model()\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model\n dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper\n return func(args, kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast\n return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn\n return fn(*args, *kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 205, in broadcast\n return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper\n return func(args, **kwargs)\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2156, in broadcast\n raise ex\n File "/ws/data/test/fate/projects/fate/common/python/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2152, in broadcast\n work = group.broadcast([tensor], opts)\nRuntimeError: fn INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/init.cpp":169, please report a bug to PyTorch. Not implemented.\n'}, 'io_meta': None}

`

Does the python version is "3.8" and the deploying packages are downloaded from here?

hejxiang commented 2 days ago

I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above.

But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly

source /ws/data/test/fate/projects/fate/bin/init_env.sh
conda list python

python 3.10.13 h955ad1f_0

mgqa34 commented 2 days ago

I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above.

But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly

source /ws/data/test/fate/projects/fate/bin/init_env.sh
conda list python

python 3.10.13 h955ad1f_0

As mentioned above,please reinstall v2.1.0 to avoid this problem, not v2.2.0.

hejxiang commented 2 days ago

I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above. But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly

source /ws/data/test/fate/projects/fate/bin/init_env.sh
conda list python

python 3.10.13 h955ad1f_0

As mentioned above,please reinstall v2.1.0 to avoid this problem, not v2.2.0.

OK, I will try it again,Thanks.

mgqa34 commented 2 days ago

I installed the cluster by AnsibleFATE_2.2.0_release_offline.tar.gz, It should be from the link above. But the python version is not 3.8, is 3.10.13, after installing the cluster, it had a python env, So I used the env directly

source /ws/data/test/fate/projects/fate/bin/init_env.sh
conda list python

python 3.10.13 h955ad1f_0

As mentioned above,please reinstall v2.1.0 to avoid this problem, not v2.2.0.

OK, I will try it again,Thanks.

You/re welcome, we’ll keep investing the issue you’ve mentioned above when training using deepspeed in v2.2.0.