fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
395 stars 29 forks source link

No such file or directory #38

Open hxue3 opened 6 months ago

hxue3 commented 6 months ago

Hi,

I am trying to replicate the trainning, but I encountered an error that I cannot solve. I replace the model name in the mono_ft.sh to your model name haoranxu/ALMA-7B-Pretrain. Then I run the script, saw this error:

pytorch_model.bin.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 203MB/s]
Downloading shards:   0%|                                                                                                             | 0/3 [00:00<?, ?it/s][INFO|modeling_utils.py:3257] 2024-03-27 20:40:50,775 >> loading weights file pytorch_model.bin from cache at .cache/models/models--haoranxu--ALMA-7B-Pretrain/snapshots/a00b4a7a96c38117ac6a4e3e7228e7b06ba992ff/pytorch_model.bin.index.json
pytorch_model-00001-of-00003.bin: 100%|████████████████████████████████████████████████████████████████████████████████| 9.88G/9.88G [04:49<00:00, 34.1MB/s]
Downloading shards:   0%|                                                                                                             | 0/3 [04:50<?, ?it/s]
Traceback (most recent call last):
  File "/s/mlsc/hxue3/alma_experiments/ALMA/run_llmmt.py", line 226, in <module>
  File "/s/mlsc/hxue3/alma_experiments/ALMA/run_llmmt.py", line 155, in main
  File "/s/mlsc/hxue3/alma_experiments/ALMA/utils/utils.py", line 350, in load_model
  File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3264, in from_pretrained
    resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/utils/hub.py", line 1038, in get_checkpoint_shard_files
    cached_filename = cached_file(
                      ^^^^^^^^^^^^
  File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
                    ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py", line 1504, in hf_hub_download
    _chmod_and_replace(temp_file.name, blob_path)
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py", line 1724, in _chmod_and_replace
    os.chmod(src, stat.S_IMODE(cache_dir_mode))
FileNotFoundError: [Errno 2] No such file or directory: '/s/mlsc/hxue3/alma_experiments/ALMA/.cache/models/tmprvwvokuy'
                                                                                                                                                           [2024-03-27 20:45:42,972] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 581 closing signal SIGTERM9M/9.88G [00:01<02:29, 65.5MB/s]
[2024-03-27 20:45:42,972] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 582 closing signal SIGTERM
[2024-03-27 20:45:42,973] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 583 closing signal SIGTERM
[2024-03-27 20:45:43,614] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 584) of binary: /usr/bin/python3

Any idea why?