DachengLi1 / LongChat

Official repository for LongChat and LongEval
Apache License 2.0
504 stars 29 forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #24

Open ChaoyuHuang opened 1 year ago

ChaoyuHuang commented 1 year ago

when i was run: python -m torch.distributed.run --nproc_per_node=2 longchat/train/fine_tune/train.py --model_name_or_path /mnt/yuchao/open_model/longchat/longchat-13b-16k --data_path /mnt/workspace/sft_data.json --bf16 --output_dir /mnt/yuchao/yuchao/longchat-13b-16k --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 1000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True --model_max_length 16384 --gradient_checkpointing True --lazy_preprocess True

and my data format is like alpaca.

and the error shows like that: Traceback (most recent call last): File "/mnt/workspace/LongChat/longchat/train/fine_tune/train.py", line 268, in train() File "/mnt/workspace/LongChat/longchat/train/fine_tune/train.py", line 262, in train trainer.train() File "/home/pai/envs/longeval/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/home/pai/envs/longeval/lib/python3.10/site-packages/transformers/trainer.py", line 1899, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in next data = self._next_data() File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/mnt/workspace/LongChat/longchat/train/fine_tune/train.py", line 210, in getitem data_dict = preprocess([e["conversations"] for e in sources], self.tokenizer) File "/mnt/workspace/LongChat/longchat/train/fine_tune/train.py", line 210, in data_dict = preprocess([e["conversations"] for e in sources], self.tokenizer) KeyError: 'conversations' WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 71411 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 71412) of binary: /home/pai/envs/longeval/bin/python Traceback (most recent call last): File "/home/pai/envs/longeval/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/pai/envs/longeval/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in main() File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/pai/envs/longeval/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

longchat/train/fine_tune/train.py FAILED