flash-attn needs data type of bfloat16 or float16

Thank you for spending time to check this. I rerun the code without modification, using 2 80GB A100 with cuda 12.1, and confirmed that mixed_precision is bf16 in the running script, got the error below:

(rag-demo) root@C.10962379:~$ bash finetune_with_accelerate.sh                                
Training llama model  using 2 GPUs, 1 batch size per GPU, 64 gradient accumulation steps      
The following values were not passed to `accelerate launch` and had defaults used instead:    
                More than one GPU was found, enabling multi-GPU training.                     
                If this was unintended please pass in `--num_processes=1`.                    
        `--dynamo_backend` was set to a value of `'no'`                                       
To avoid this warning pass in values for each of the problematic parameters or run `accelerate
 config`.                                                                                     
[2024-05-26 06:54:50,555] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)                                                                     
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. 
 [WARNING]  async_io: please install the libaio-dev package with apt                          
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.                                        
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH   
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1            
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible       
[2024-05-26 06:54:51,509] torch.distributed.run: [WARNING]                                    
[2024-05-26 06:54:51,509] torch.distributed.run: [WARNING] ***********************************
******                                                                                        
[2024-05-26 06:54:51,509] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment
 variable for each process to be 1 in default, to avoid your system being overloaded, please f
urther tune the variable for optimal performance in your application as needed.               
[2024-05-26 06:54:51,509] torch.distributed.run: [WARNING] ***********************************
******                                                                                        
[2024-05-26 06:54:53,743] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)
[2024-05-26 06:54:53,758] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.
[WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[2024-05-26 06:54:54,739] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-26 06:54:54,750] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-26 06:54:54,750] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in D
eepSpeed with backend nccl
05/26/2024 06:54:54 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16
ds_config: {'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'overlap_comm': True,
 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', '
stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_ma
x_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16
bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping':
 'auto', 'steps_per_print': inf, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu':
 'auto', 'wall_clock_breakdown': False, 'fp16': {'enabled': False}}

05/26/2024 06:54:54 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16
ds_config: {'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'overlap_comm': True,
 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', '
stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_ma
x_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16
bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping':
 'auto', 'steps_per_print': inf, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu':
 'auto', 'wall_clock_breakdown': False, 'fp16': {'enabled': False}}

/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1
132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Down
loads always resume when possible. If you want to force a new download, use `force_download=Tr
ue`.
  warnings.warn(
/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1
132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Down
loads always resume when possible. If you want to force a new download, use `force_download=Tr
ue`.
  warnings.warn(
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta
-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/config.json
Model config LlamaConfig {
  "_name_or_path": "meta-llama/Llama-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
   "hidden_act": "silu", 
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null, 
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.36.2",
  "use_cache": true,
  "vocab_size": 32000
}

loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--meta-llama--Ll
ama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/tokenizer.model
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--meta-l
lama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--meta-lla
ma--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/tokenizer_config.json
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Lla
ma-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/tokenizer.json
loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--meta
-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/model.safetensors.ind
ex.json
Detected DeepSpeed ZeRO-3: activating zero.init() for this model
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed i
n a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lea
d to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure t
o move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-05-26 06:54:57,119] [INFO] [partition_parameters.py:345:__exit__] finished initializing 
model - num_params = 0, num_elems = 0.00B
Traceback (most recent call last):
  File "/root/finetune.py", line 899, in <module>
    main()
  File "/root/finetune.py", line 561, in main
    model = AutoModelForCausalLM.from_pretrained(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/auto/a
uto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 3462, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/deepspeed/runtime/zero/par
tition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/llama/
modeling_llama.py", line 1108, in __init__
Traceback (most recent call last):
  File "/root/finetune.py", line 899, in <module>
    super().__init__(config)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/deepspeed/runtime/zero/par
tition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1190, in __init__
    main()
  File "/root/finetune.py", line 561, in main
    config = self._autoset_attn_implementation( 
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1302, in _autoset_attn_implementation
    model = AutoModelForCausalLM.from_pretrained(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/auto/a
uto_factory.py", line 566, in from_pretrained
        return model_class.from_pretrained(cls._check_and_enable_flash_attn_2(

  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 3462, in from_pretrained
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1422, in _check_and_enable_flash_attn_2
    raise ValueError(
ValueError: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. You pas
sed torch.float32, this might lead to unexpected behaviour.
    model = cls(config, *model_args, **model_kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/deepspeed/runtime/zero/par
tition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/llama/
modeling_llama.py", line 1108, in __init__
    super().__init__(config)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/deepspeed/runtime/zero/par
tition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1190, in __init__
    config = self._autoset_attn_implementation( 
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1302, in _autoset_attn_implementation
    cls._check_and_enable_flash_attn_2(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1422, in _check_and_enable_flash_attn_2
    raise ValueError(
ValueError: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. You pas
sed torch.float32, this might lead to unexpected behaviour.
[2024-05-26 06:55:01,523] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitc
ode: 1) local_rank: 0 (pid: 1904) of binary: /root/miniconda3/envs/rag-demo/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/rag-demo/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/accele
  rate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/launch
.py", line 1067, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/launch
.py", line 771, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/run.py",
 line 797, in run
    elastic_launch(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/launcher
/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/launcher
/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-26_06:55:01
  host      : c59e9e8e072e
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1905)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_06:55:01
  host      : c59e9e8e072e
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1904)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

At first I think it's the problem of my PyTorch version which is 2.1.2, then I switch to torch 2.0.1 and 2.1.0 with cuda 11.8, got the error below:

(rag-demo) root@C.10962598:~$ bash finetune_with_accelerate.sh                                
Training llama model  using 2 GPUs, 1 batch size per GPU, 64 gradient accumulation steps      
The following values were not passed to `accelerate launch` and had defaults used instead:    
                More than one GPU was found, enabling multi-GPU training.                     
                If this was unintended please pass in `--num_processes=1`.                    
        `--dynamo_backend` was set to a value of `'no'`                                       
To avoid this warning pass in values for each of the problematic parameters or run `accelerate
 config`.                                                                                     
[2024-05-26 07:28:21,999] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)                                                                     
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. 
 [WARNING]  async_io: please install the libaio-dev package with apt                          
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.                                        
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH   
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0            
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible       
WARNING:torch.distributed.run:                                                                
*****************************************                                                     
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid you
r system being overloaded, please further tune the variable for optimal performance in your ap
plication as needed.                                                                          
*****************************************                                                     
[2024-05-26 07:28:24,682] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)                                                                     
[2024-05-26 07:28:24,686] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)                                                                     
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. 
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. 
 [WARNING]  async_io: please install the libaio-dev package with apt                          
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.                                        
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
Traceback (most recent call last):
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1382, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/importlib/__init__.py", line 126, in imp
ort_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/opt/mo
deling_opt.py", line 46, in <module>
    from flash_attn import flash_attn_func, flash_attn_varlen_func
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn/__init__.py", l
ine 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn/flash_attn_inte
rface.py", line 10, in <module>
    import flash_attn_2_cuda as flash_attn_cuda 
ImportError: /root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn_2_cuda.cpy
thon-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/finetune.py", line 24, in <module>
    from transformers import (
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1373, in __getattr__
    value = getattr(module, name)
      File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1372, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1384, in _get_module
    raise RuntimeError( 
RuntimeError: Failed to import transformers.models.opt.modeling_opt because of the following e
rror (look up to see its traceback):
/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_
64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
Traceback (most recent call last):
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1382, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/importlib/__init__.py", line 126, in imp
ort_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/opt/mo
deling_opt.py", line 46, in <module>
    from flash_attn import flash_attn_func, flash_attn_varlen_func
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn/__init__.py", l
ine 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn/flash_attn_inte
rface.py", line 10, in <module>
    import flash_attn_2_cuda as flash_attn_cuda 
ImportError: /root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn_2_cuda.cpy
thon-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/finetune.py", line 24, in <module>
    from transformers import (
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1373, in __getattr__
    value = getattr(module, name)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1372, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1384, in _get_module
    raise RuntimeError( 
RuntimeError: Failed to import transformers.models.opt.modeling_opt because of the following e
rror (look up to see its traceback):
/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_
64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1
910) of binary: /root/miniconda3/envs/rag-demo/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/rag-demo/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/accele
rate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/launch
.py", line 1067, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/launch
.py", line 771, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/run.py",
 line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/launcher
/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/launcher
/api.py", line 250, in launch_agent
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-26_07:28:27
  host      : fd24cacc2f1a
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1911)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_07:28:27
  host      : fd24cacc2f1a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1910)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Maybe it's the problem of the cuda, I will check this problem in the future, and for now I will use the code I have modified . If you'd like to confirm the error, please let me know.

Thank you for spending time.

allenai / open-instruct

flash-attn needs data type of bfloat16 or float16 #165