Issue with accelerate Configuration and Multi-Process Tokenizer Loading

I would like to verify whether the following accelerate configuration is suitable for running the pretraining script. Below is the configuration that I used:

---------------------------------------------------------------------------------------In which compute environment are you running?
This machine
---------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file: /mnt/prjM3D/M3D/LaMed/default_config.json
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: no
Do you want to enable Mixture-of-Experts training (MoE)? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:8
accelerate configuration saved at /home/htong/.cache/huggingface/accelerate/default_config.yaml

Here is the content of my default_config.yaml:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_config_file: /mnt/prjM3D/M3D/LaMed/default_config.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Additionally, I would like to check if the following DeepSpeed JSON configuration file is appropriate for my pretraining script:

{
    "compute_environment": "LOCAL_MACHINE",
    "debug": false,
    "deepspeed_config": {
      "gradient_accumulation_steps": 1,
      "zero3_init_flag": false,
      "zero_stage": 0
    },
    "zero_optimization": {
      "stage": 0,
      "allgather_partitions": true,
      "reduce_scatter": true,
      "allgather_bucket_size": 5e8,
      "overlap_comm": true,
      "contiguous_gradients": true,
      "train_batch_size": 1024 
    },
    "distributed_type": "DEEPSPEED",
    "downcast_bf16": "no",
    "machine_rank": 0,
    "main_training_function": "main",
    "mixed_precision": "bf16",
    "num_machines": 1,
    "num_processes": 8,
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_env": [],
    "tpu_use_cluster": false,
    "tpu_use_sudo": false,
    "use_cpu": false
  }

I've encountered an issue when running the pretraining script: the script fails to load the tokenizer from the local path when --num_processes is set to a value greater than 1. However, when I set --num_processes to 1, the tokenizer loads successfully. I suspect that this might be related to the accelerate or DeepSpeed framework. Below is the error trace:

==================== Tokenizer preparation ====================
Traceback (most recent call last):
  File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/transformers/utils/hub.py", line 402, in cached_file
    resolved_file = hf_hub_download(
  File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
  File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1240, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1347, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "/home/htong/miniconda3/envs/M3D/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1848, in _raise_on_head_call_error
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.

How can I resolve this issue? Any advice or guidance on how to configure accelerate and DeepSpeed to work correctly with multi-GPU setups would be greatly appreciated.

BAAI-DCAI / M3D

Issue with accelerate Configuration and Multi-Process Tokenizer Loading #14