FederatedAI / FATE

An Industrial Grade Federated Learning Framework
Apache License 2.0
5.67k stars 1.55k forks source link

Fedkseed #5566

Closed zhaoaustin closed 2 months ago

zhaoaustin commented 6 months ago

您好,在fedkseed的ipynb的submit task to FATE的模块中。我没法import get_config_of_seq2seq_runner,LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader。我在fate_client.pipeline.components.fate.homo_nn中没法找到相应的代码

import time from fate_client.pipeline.components.fate.reader import Reader from fate_client.pipeline import FateFlowPipeline from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_config_of_seq2seq_runner from fate_client.pipeline.components.fate.nn.algo_params import TrainingArguments, FedAVGArguments from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader

mgqa34 commented 6 months ago

fate-client是否已经更新到了2.1版本呢?

zhaoaustin commented 6 months ago

谢谢,我跟新了之后解决了上述这个问题。我遇到了新的以下两个问题。

  1. 在我的服务器上跑的fedkseed的实验结果和ipynb文件中的loss和不太一样,但zero-optimization的确是比adam效果差的 image
  2. 对于ipynb中的submit federated task,我用的standardalone,我遇到以下问题, ValueError: query job is failed, response={'code': 1001, 'message': 'No found job: job_id[202404070456598629540],role[guest],party_id[9999]'} 请问我在submit 之前要做什么操作吗,我对ipynb的代码没有动过。 谢谢您的回复!
zhaoaustin commented 6 months ago

image 具体报错的代码长这样

zhaoaustin commented 5 months ago

image 作者您好,在我把guest 从guest = '10000'改到了guest = '9999',任务可以一开始执行起来。但是执行了几轮之后就会产生同样的错误。针对这个问题请问能指导一下吗

sagewe commented 5 months ago

image 作者您好,在我把guest 从guest = '10000'改到了guest = '9999',任务可以一开始执行起来。但是执行了几轮之后就会产生同样的错误。针对这个问题请问能指导一下吗

能提供下相关的日志吗

zhaoaustin commented 5 months ago

以下是gpt2,dolly数据的fate_flow_sql.log

[INFO] [2024-05-02 02:28:49,003] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929002, "f_error_report" = 'Traceback (most recent call last): File "/data/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 147, in execute_component_from_config component.execute(ctx, role, execution_io.get_kwargs()) File "/data/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute return self.callback(ctx, role, kwargs) File "/data/projects/fate/fate/python/fate/components/components/homo_nn.py", line 61, in train train_procedure( File "/data/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 155, in train_procedure runner.train(traindata, validatedata, output_dir, saved_model_path) File "/data/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 270, in train trainer.train() File "/data/projects/fate/fate/python/fate_llm/fedkseed/fedkseed.py", line 123, in train direction_derivative_history = self.train_once( File "/data/projects/fate/fate/python/fate_llm/fedkseed/fedkseed.py", line 154, in train_once trainer.train() File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train return inner_training_loop( File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/accelerate/data_loader.py", line 452, in iter current_batch = next(dataloader_iter) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 45, in call return self.torch_call(features) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 761, in torch_call batch = pad_without_fast_tokenizer_warning( File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning padded = tokenizer.pad(*pad_args, **pad_kwargs) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3286, in pad paddingstrategy, , maxlength, = self._get_padding_truncation_strategies( File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2734, in _get_padding_truncation_strategies raise ValueError( ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}). ' WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))

zhaoaustin commented 5 months ago

以下是我的具体的代码 import time from fate_client.pipeline.components.fate.reader import Reader from fate_client.pipeline import FateFlowPipeline from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_config_of_seq2seq_runner from fate_client.pipeline.components.fate.nn.algo_params import TrainingArguments, FedAVGArguments from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader

guest = '10000' host = '10000' arbiter = '10000'

epochs = 0.01 batch_size = 1 lr = 1e-5

pipeline = FateFlowPipeline().set_parties(guest=guest, arbiter=arbiter) pipeline.bind_local_path(path="/data/projects/fate/examples/data/dolly", namespace="experiment", name="dolly") time.sleep(5)

reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest, host=host)) reader_0.guest.task_parameters( namespace="experiment", name="dolly" ) reader_0.hosts[0].task_parameters( namespace="experiment", name="dolly" )

tokenizer_params = dict( pretrained_model_name_or_path="gpt2", trust_remote_code=True, ) conf = get_config_of_seq2seq_runner( algo='fedkseed', model=LLMModelLoader( "hf_model", "HFAutoModelForCausalLM",

pretrained_model_name_or_path="datajuicer/LLaMA-1B-dj-refine-150B",

    pretrained_model_name_or_path="gpt2",
    trust_remote_code=True
),
dataset=LLMDatasetLoader(
    "hf_dataset",
    "Dolly15K",
    split="train",
    tokenizer_params=tokenizer_params,
    tokenizer_apply_params=dict(
        truncation=True,
        max_length=1024,
    )),
data_collator=LLMDataFuncLoader(
    "cust_func.cust_data_collator",
    "get_seq2seq_tokenizer",
    tokenizer_params=tokenizer_params,
),
training_args=TrainingArguments(
    num_train_epochs=0.01,
    per_device_train_batch_size=batch_size,
    remove_unused_columns=True,
    learning_rate=lr,
    fp16=False,
    use_cpu=False,
    disable_tqdm=False,
    use_mps_device=True,
),
fed_args=FedAVGArguments(),
task_type='causal_lm',
save_trainable_weights_only=True,

)

conf["fed_args_conf"] = {}

homo_nn_0 = HomoNN( 'nn_0', runner_conf=conf, train_data=reader_0.outputs["output_data"], runner_module="fedkseed_runner", runner_class="FedKSeedRunner", )

pipeline.add_tasks([reader_0, homo_nn_0]) pipeline.conf.set("task", dict(engine_run={"cores": 1}))

pipeline.compile() pipeline.fit()

zhaoaustin commented 5 months ago

我把tokenizer_params 改了一下, 解决了以上问题 tokenizer_params = dict( pretrained_model_name_or_path="gpt2", trust_remote_code=True, pad_token="<|endoftext|>" # 添加 pad_token )

zhaoaustin commented 5 months ago

但我出现了新的以下的问题,麻烦能解答下吗 [INFO] [2024-05-02 03:36:43,059] [202405020336083952670] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714621003059, "f_error_report" = 'Traceback (most recent call last): File "/data/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 147, in execute_component_from_config component.execute(ctx, role, execution_io.get_kwargs()) File "/data/projects/fate/fate/python/fate/components/core/component_desc/_component.py", line 101, in execute return self.callback(ctx, role, kwargs) File "/data/projects/fate/fate/python/fate/components/components/homo_nn.py", line 61, in train train_procedure( File "/data/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 155, in train_procedure runner.train(traindata, validatedata, output_dir, saved_model_path) File "/data/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 270, in train trainer.train() File "/data/projects/fate/fate/python/fate_llm/fedkseed/fedkseed.py", line 123, in train direction_derivative_history = self.train_once( File "/data/projects/fate/fate/python/fate_llm/fedkseed/fedkseed.py", line 154, in train_once trainer.train() File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train return inner_training_loop( File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/data/projects/fate/fate/python/fate_llm/fedkseed/trainer.py", line 96, in training_step loss = self._kseed_optimizer.kseed_zeroth_order_step(closure=closure) File "/data/projects/fate/fate/python/fate_llm/fedkseed/optimizer.py", line 228, in kseed_zeroth_order_step directional_derivative_value, loss_right, loss_left = self.zeroth_order_step(seed, closure) File "/data/projects/fate/fate/python/fate_llm/fedkseed/optimizer.py", line 129, in zeroth_order_step loss_right = closure() File "/data/projects/fate/fate/python/fate_llm/fedkseed/trainer.py", line 90, in closure return self.compute_loss(model, inputs, return_outputs=False).detach() File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2925, in compute_loss outputs = model(inputs) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 175, in forward inputs, module_kwargs = self.scatter(inputs, kwargs, self.device_ids) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 197, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 74, in scatter_kwargs scattered_kwargs = scatter(kwargs, target_gpus, dim) if kwargs else [] File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 60, in scatter res = scatter_map(inputs) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 51, in scatter_map return [type(obj)(i) for i in zip(map(scatter_map, obj.items()))] File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 47, in scatter_map return list(zip(map(scatter_map, obj))) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 43, in scatter_map return Scatter.apply(target_gpus, None, dim, obj) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 187, in scatter return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)) RuntimeError: CUDA error: peer mapping resources exhausted CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

sagewe commented 2 months ago

使用新版本试试,如果还有问题请重新打开issue