alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
674 stars 94 forks source link

[rank31]: OSError: error stat()ing file 数据集map问题 #305

Open shyzzz521 opened 2 months ago

shyzzz521 commented 2 months ago

rank31: Traceback (most recent call last): rank31: File "/home/jovyan/dataws1/magron/Pai-Megatron-Patch/examples/qwen2/pretrain_qwen.py", line 214, in

rank31: File "/home/jovyan/dataws1/magron/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 251, in pretrain rank31: = build_train_valid_test_data_iterators( rank31: File "/home/jovyan/dataws1/magron/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 1461, in build_train_valid_test_data_iterators

rank31: File "/home/jovyan/dataws1/magron/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 1422, in build_train_valid_test_data_loaders rank31: train_ds, valid_ds, test_ds = build_train_valid_test_datasets( rank31: File "/home/jovyan/dataws1/magron/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 1392, in build_train_valid_test_datasets rank31: return build_train_valid_test_datasets_provider(train_valid_test_num_samples) rank31: File "/home/jovyan/dataws1/magron/Pai-Megatron-Patch/examples/qwen2/pretrain_qwen.py", line 190, in train_valid_test_datasets_provider rank31: train_ds, valid_ds, test_ds = build_pretrain_dataset_from_original(args.dataset) rank31: File "/home/jovyan/dataws1/magron/Pai-Megatron-Patch/megatron_patch/data/init.py", line 87, in build_pretrain_dataset_from_original rank31: train_dataset = LLamaRawDataset(args.train_data_path, args.max_padding_length) rank31: File "/home/jovyan/dataws1/magron/Pai-Megatron-Patch/megatron_patch/data/llama.py", line 80, in init rank31: train_dataset = list_data_dict.map( rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 592, in wrapper rank31: out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper rank31: out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3093, in map rank31: for rank, done, content in Dataset._map_single(dataset_kwargs): rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3522, in _map_single rank31: yield rank, True, Dataset.from_file(cache_file_name, info=info, split=shard.split) rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 762, in from_file rank31: table = ArrowReader.read_table(filename, in_memory=in_memory) rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/arrow_reader.py", line 357, in read_table rank31: return table_cls.from_file(filename) rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/table.py", line 1022, in from_file rank31: table = _memory_mapped_arrow_table_from_file(filename) rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/table.py", line 64, in _memory_mapped_arrow_table_from_file rank31: opened_stream = _memory_mapped_record_batch_reader_from_file(filename) rank31: File "/home/jovyan/kys-workspace-zzzc/anaconda3/envs/megatron/lib/python3.10/site-packages/datasets/table.py", line 49, in _memory_mapped_record_batch_reader_from_file rank31: memory_mapped_stream = pa.memory_map(filename) rank31: File "pyarrow/io.pxi", line 1009, in pyarrow.lib.memory_map rank31: File "pyarrow/io.pxi", line 956, in pyarrow.lib.MemoryMappedFile._open rank31: File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status rank31: File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status rank31: OSError: error stat()ing file

这种要怎么解决呢?