I would like to verify whether the following accelerate configuration is suitable for running the pretraining script. Below is the configuration that I used:
---------------------------------------------------------------------------------------In which compute environment are you running?
This machine
---------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file: /mnt/prjM3D/M3D/LaMed/default_config.json
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: no
Do you want to enable Mixture-of-Experts training (MoE)? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:8
accelerate configuration saved at /home/htong/.cache/huggingface/accelerate/default_config.yaml
I've encountered an issue when running the pretraining script: the script fails to load the tokenizer from the local path when --num_processes is set to a value greater than 1. However, when I set --num_processes to 1, the tokenizer loads successfully. I suspect that this might be related to the accelerate or DeepSpeed framework. Below is the error trace:
==================== Tokenizer preparation ====================
Traceback (most recent call last):
File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/transformers/utils/hub.py", line 402, in cached_file
resolved_file = hf_hub_download(
File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
return f(*args, **kwargs)
File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1240, in hf_hub_download
return _hf_hub_download_to_cache_dir(
File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1347, in _hf_hub_download_to_cache_dir
_raise_on_head_call_error(head_call_error, force_download, local_files_only)
File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1848, in _raise_on_head_call_error
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.
How can I resolve this issue? Any advice or guidance on how to configure accelerate and DeepSpeed to work correctly with multi-GPU setups would be greatly appreciated.
I would like to verify whether the following
accelerate
configuration is suitable for running the pretraining script. Below is the configuration that I used:Here is the content of my
default_config.yaml
:Additionally, I would like to check if the following DeepSpeed JSON configuration file is appropriate for my pretraining script:
I've encountered an issue when running the pretraining script: the script fails to load the tokenizer from the local path when
--num_processes
is set to a value greater than 1. However, when I set--num_processes
to 1, the tokenizer loads successfully. I suspect that this might be related to theaccelerate
orDeepSpeed
framework. Below is the error trace:How can I resolve this issue? Any advice or guidance on how to configure
accelerate
andDeepSpeed
to work correctly with multi-GPU setups would be greatly appreciated.