I am trying to replicate the trainning, but I encountered an error that I cannot solve. I replace the model name in the mono_ft.sh to your model name haoranxu/ALMA-7B-Pretrain. Then I run the script, saw this error:
pytorch_model.bin.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 203MB/s]
Downloading shards: 0%| | 0/3 [00:00<?, ?it/s][INFO|modeling_utils.py:3257] 2024-03-27 20:40:50,775 >> loading weights file pytorch_model.bin from cache at .cache/models/models--haoranxu--ALMA-7B-Pretrain/snapshots/a00b4a7a96c38117ac6a4e3e7228e7b06ba992ff/pytorch_model.bin.index.json
pytorch_model-00001-of-00003.bin: 100%|████████████████████████████████████████████████████████████████████████████████| 9.88G/9.88G [04:49<00:00, 34.1MB/s]
Downloading shards: 0%| | 0/3 [04:50<?, ?it/s]
Traceback (most recent call last):
File "/s/mlsc/hxue3/alma_experiments/ALMA/run_llmmt.py", line 226, in <module>
File "/s/mlsc/hxue3/alma_experiments/ALMA/run_llmmt.py", line 155, in main
File "/s/mlsc/hxue3/alma_experiments/ALMA/utils/utils.py", line 350, in load_model
File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3264, in from_pretrained
resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/utils/hub.py", line 1038, in get_checkpoint_shard_files
cached_filename = cached_file(
^^^^^^^^^^^^
File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/utils/hub.py", line 398, in cached_file
resolved_file = hf_hub_download(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py", line 1504, in hf_hub_download
_chmod_and_replace(temp_file.name, blob_path)
File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py", line 1724, in _chmod_and_replace
os.chmod(src, stat.S_IMODE(cache_dir_mode))
FileNotFoundError: [Errno 2] No such file or directory: '/s/mlsc/hxue3/alma_experiments/ALMA/.cache/models/tmprvwvokuy'
[2024-03-27 20:45:42,972] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 581 closing signal SIGTERM9M/9.88G [00:01<02:29, 65.5MB/s]
[2024-03-27 20:45:42,972] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 582 closing signal SIGTERM
[2024-03-27 20:45:42,973] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 583 closing signal SIGTERM
[2024-03-27 20:45:43,614] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 584) of binary: /usr/bin/python3
Hi,
I am trying to replicate the trainning, but I encountered an error that I cannot solve. I replace the model name in the mono_ft.sh to your model name
haoranxu/ALMA-7B-Pretrain
. Then I run the script, saw this error:Any idea why?