FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.58k stars 468 forks source link

raise RuntimeError("mmap can only be used with files saved with #846

Open ben-8878 opened 2 months ago

ben-8878 commented 2 months ago
Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading input_embeddings to disk to save VRAM
Traceback (most recent call last):
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/llmodel/zyb/FlagEmbedding/Long_LLM/longllm_qlora/main/train.py", line 112, in <module>
    main()
  File "/data/llmodel/zyb/FlagEmbedding/Long_LLM/longllm_qlora/main/train.py", line 47, in main
    model = FastLanguageModel.get_peft_model(
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/site-packages/unsloth/models/llama.py", line 1598, in get_peft_model
    offload_input_embeddings(model, temporary_location)
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/site-packages/unsloth/models/_utils.py", line 479, in offload_input_embeddings
    offloaded_W = offload_to_disk(model.get_input_embeddings(), model, "input_embeddings", temporary_location)
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/site-packages/unsloth/models/_utils.py", line 472, in offload_to_disk
    offloaded_W = torch.load(filename, map_location = "cpu", mmap = True)
  File "/data/llmodel/.local/lib/python3.10/site-packages/torch/serialization.py", line 1032, in load
    raise RuntimeError("mmap can only be used with files saved with "
RuntimeError: mmap can only be used with files saved with `torch.save(_use_new_zipfile_serialization=True), please torch.save your checkpoint with this option in order to use mmap.
Traceback (most recent call last):
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/llmodel/zyb/FlagEmbedding/Long_LLM/longllm_qlora/main/train.py", line 112, in <module>
    main()
  File "/data/llmodel/zyb/FlagEmbedding/Long_LLM/longllm_qlora/main/train.py", line 47, in main
    model = FastLanguageModel.get_peft_model(
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/site-packages/unsloth/models/llama.py", line 1598, in get_peft_model
    offload_input_embeddings(model, temporary_location)
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/site-packages/unsloth/models/_utils.py", line 479, in offload_input_embeddings
    offloaded_W = offload_to_disk(model.get_input_embeddings(), model, "input_embeddings", temporary_location)
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/lib/python3.10/site-packages/unsloth/models/_utils.py", line 472, in offload_to_disk
    offloaded_W = torch.load(filename, map_location = "cpu", mmap = True)
  File "/data/llmodel/.local/lib/python3.10/site-packages/torch/serialization.py", line 1032, in load
    raise RuntimeError("mmap can only be used with files saved with "
RuntimeError: mmap can only be used with files saved with `torch.save(_use_new_zipfile_serialization=True), please torch.save your checkpoint with this option in order to use mmap.
Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 0 MLP layers.
Unsloth: Casting embed_tokens to float32
Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 0 MLP layers.
Unsloth: Casting embed_tokens to float32
[2024-05-31 09:13:56,940] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 103260 closing signal SIGTERM
[2024-05-31 09:13:56,940] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 103261 closing signal SIGTERM
[2024-05-31 09:13:59,309] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 103259) of binary: /data/llmodel/Tools/software_install/anaconda3/envs/unsloth/bin/python
Traceback (most recent call last):
  File "/data/llmodel/Tools/software_install/anaconda3/envs/unsloth/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
  File "/data/llmodel/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/llmodel/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/data/llmodel/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/data/llmodel/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/llmodel/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-31_09:13:56
  host      : ubuntu
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 103262)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-31_09:13:56
  host      : ubuntu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 103259)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
disperaller commented 2 months ago

I encountered the same issue. The way i resolved this is to put a torch.distributed.barrier before dataset loading to ensure all the processes are comleted before entering into data preparation step. This theoretically will reduce the chance of running into this issue. To further avoid it, i even put a time.sleep(10) before checking the integrity of the zip file in the same file where that error pops up. As of today, i haven't run into this issue after the above 2 modifications.

ben-8878 commented 2 months ago

@disperaller has some sample codes ? thanks