allenai / open-instruct

Apache License 2.0
1.21k stars 166 forks source link

flash-attn needs data type of bfloat16 or float16 #165

Closed notoookay closed 4 months ago

notoookay commented 4 months ago

HI, I tried to fine-tune the llama2 model, and I want to use flash-attn, it seems like flash-attn only supports float16 or bfloat16, you guys may need to check this.

hamishivi commented 4 months ago

Hi, I think if you launch the training with accelerate (as suggested in our example scripts) and set mixed_precision to bf16, then everything is autocast to the right format. What command did you run when you got the error, and what did the error look like? It'd be good to have some context for the change, since I haven't encountered this issue before.

notoookay commented 4 months ago

Thank you for spending time to check this. I rerun the code without modification, using 2 80GB A100 with cuda 12.1, and confirmed that mixed_precision is bf16 in the running script, got the error below:

(rag-demo) root@C.10962379:~$ bash finetune_with_accelerate.sh                                
Training llama model  using 2 GPUs, 1 batch size per GPU, 64 gradient accumulation steps      
The following values were not passed to `accelerate launch` and had defaults used instead:    
                More than one GPU was found, enabling multi-GPU training.                     
                If this was unintended please pass in `--num_processes=1`.                    
        `--dynamo_backend` was set to a value of `'no'`                                       
To avoid this warning pass in values for each of the problematic parameters or run `accelerate
 config`.                                                                                     
[2024-05-26 06:54:50,555] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)                                                                     
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. 
 [WARNING]  async_io: please install the libaio-dev package with apt                          
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.                                        
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH   
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1            
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible       
[2024-05-26 06:54:51,509] torch.distributed.run: [WARNING]                                    
[2024-05-26 06:54:51,509] torch.distributed.run: [WARNING] ***********************************
******                                                                                        
[2024-05-26 06:54:51,509] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment
 variable for each process to be 1 in default, to avoid your system being overloaded, please f
urther tune the variable for optimal performance in your application as needed.               
[2024-05-26 06:54:51,509] torch.distributed.run: [WARNING] ***********************************
******                                                                                        
[2024-05-26 06:54:53,743] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)
[2024-05-26 06:54:53,758] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.
[WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[2024-05-26 06:54:54,739] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-26 06:54:54,750] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-26 06:54:54,750] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in D
eepSpeed with backend nccl
05/26/2024 06:54:54 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16
ds_config: {'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'overlap_comm': True,
 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', '
stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_ma
x_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16
bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping':
 'auto', 'steps_per_print': inf, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu':
 'auto', 'wall_clock_breakdown': False, 'fp16': {'enabled': False}}

05/26/2024 06:54:54 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16
ds_config: {'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'overlap_comm': True,
 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', '
stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_ma
x_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16
bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping':
 'auto', 'steps_per_print': inf, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu':
 'auto', 'wall_clock_breakdown': False, 'fp16': {'enabled': False}}

/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1
132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Down
loads always resume when possible. If you want to force a new download, use `force_download=Tr
ue`.
  warnings.warn(
/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1
132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Down
loads always resume when possible. If you want to force a new download, use `force_download=Tr
ue`.
  warnings.warn(
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta
-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/config.json
Model config LlamaConfig {
  "_name_or_path": "meta-llama/Llama-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
   "hidden_act": "silu", 
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null, 
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.36.2",
  "use_cache": true,
  "vocab_size": 32000
}

loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--meta-llama--Ll
ama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/tokenizer.model
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--meta-l
lama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--meta-lla
ma--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/tokenizer_config.json
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Lla
ma-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/tokenizer.json
loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--meta
-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/model.safetensors.ind
ex.json
Detected DeepSpeed ZeRO-3: activating zero.init() for this model
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed i
n a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lea
d to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure t
o move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-05-26 06:54:57,119] [INFO] [partition_parameters.py:345:__exit__] finished initializing 
model - num_params = 0, num_elems = 0.00B
Traceback (most recent call last):
  File "/root/finetune.py", line 899, in <module>
    main()
  File "/root/finetune.py", line 561, in main
    model = AutoModelForCausalLM.from_pretrained(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/auto/a
uto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 3462, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/deepspeed/runtime/zero/par
tition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/llama/
modeling_llama.py", line 1108, in __init__
Traceback (most recent call last):
  File "/root/finetune.py", line 899, in <module>
    super().__init__(config)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/deepspeed/runtime/zero/par
tition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1190, in __init__
    main()
  File "/root/finetune.py", line 561, in main
    config = self._autoset_attn_implementation( 
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1302, in _autoset_attn_implementation
    model = AutoModelForCausalLM.from_pretrained(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/auto/a
uto_factory.py", line 566, in from_pretrained
        return model_class.from_pretrained(cls._check_and_enable_flash_attn_2(

  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 3462, in from_pretrained
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1422, in _check_and_enable_flash_attn_2
    raise ValueError(
ValueError: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. You pas
sed torch.float32, this might lead to unexpected behaviour.
    model = cls(config, *model_args, **model_kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/deepspeed/runtime/zero/par
tition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/llama/
modeling_llama.py", line 1108, in __init__
    super().__init__(config)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/deepspeed/runtime/zero/par
tition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1190, in __init__
    config = self._autoset_attn_implementation( 
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1302, in _autoset_attn_implementation
    cls._check_and_enable_flash_attn_2(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/modeling_util
s.py", line 1422, in _check_and_enable_flash_attn_2
    raise ValueError(
ValueError: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. You pas
sed torch.float32, this might lead to unexpected behaviour.
[2024-05-26 06:55:01,523] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitc
ode: 1) local_rank: 0 (pid: 1904) of binary: /root/miniconda3/envs/rag-demo/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/rag-demo/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/accele
  rate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/launch
.py", line 1067, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/launch
.py", line 771, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/run.py",
 line 797, in run
    elastic_launch(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/launcher
/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/launcher
/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-26_06:55:01
  host      : c59e9e8e072e
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1905)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_06:55:01
  host      : c59e9e8e072e
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1904)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

At first I think it's the problem of my PyTorch version which is 2.1.2, then I switch to torch 2.0.1 and 2.1.0 with cuda 11.8, got the error below:

(rag-demo) root@C.10962598:~$ bash finetune_with_accelerate.sh                                
Training llama model  using 2 GPUs, 1 batch size per GPU, 64 gradient accumulation steps      
The following values were not passed to `accelerate launch` and had defaults used instead:    
                More than one GPU was found, enabling multi-GPU training.                     
                If this was unintended please pass in `--num_processes=1`.                    
        `--dynamo_backend` was set to a value of `'no'`                                       
To avoid this warning pass in values for each of the problematic parameters or run `accelerate
 config`.                                                                                     
[2024-05-26 07:28:21,999] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)                                                                     
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. 
 [WARNING]  async_io: please install the libaio-dev package with apt                          
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.                                        
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH   
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0            
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible       
WARNING:torch.distributed.run:                                                                
*****************************************                                                     
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid you
r system being overloaded, please further tune the variable for optimal performance in your ap
plication as needed.                                                                          
*****************************************                                                     
[2024-05-26 07:28:24,682] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)                                                                     
[2024-05-26 07:28:24,686] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelera
tor to cuda (auto detect)                                                                     
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. 
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found. 
 [WARNING]  async_io: please install the libaio-dev package with apt                          
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.                                        
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and L
DFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
Traceback (most recent call last):
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1382, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/importlib/__init__.py", line 126, in imp
ort_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/opt/mo
deling_opt.py", line 46, in <module>
    from flash_attn import flash_attn_func, flash_attn_varlen_func
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn/__init__.py", l
ine 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn/flash_attn_inte
rface.py", line 10, in <module>
    import flash_attn_2_cuda as flash_attn_cuda 
ImportError: /root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn_2_cuda.cpy
thon-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/finetune.py", line 24, in <module>
    from transformers import (
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1373, in __getattr__
    value = getattr(module, name)
      File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1372, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1384, in _get_module
    raise RuntimeError( 
RuntimeError: Failed to import transformers.models.opt.modeling_opt because of the following e
rror (look up to see its traceback):
/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_
64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
Traceback (most recent call last):
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1382, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/importlib/__init__.py", line 126, in imp
ort_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/models/opt/mo
deling_opt.py", line 46, in <module>
    from flash_attn import flash_attn_func, flash_attn_varlen_func
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn/__init__.py", l
ine 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn/flash_attn_inte
rface.py", line 10, in <module>
    import flash_attn_2_cuda as flash_attn_cuda 
ImportError: /root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn_2_cuda.cpy
thon-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/finetune.py", line 24, in <module>
    from transformers import (
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1373, in __getattr__
    value = getattr(module, name)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1372, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/transformers/utils/import_
utils.py", line 1384, in _get_module
    raise RuntimeError( 
RuntimeError: Failed to import transformers.models.opt.modeling_opt because of the following e
rror (look up to see its traceback):
/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_
64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1
910) of binary: /root/miniconda3/envs/rag-demo/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/rag-demo/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/accele
rate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/launch
.py", line 1067, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/accelerate/commands/launch
.py", line 771, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/run.py",
 line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/launcher
/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/rag-demo/lib/python3.10/site-packages/torch/distributed/launcher
/api.py", line 250, in launch_agent
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-26_07:28:27
  host      : fd24cacc2f1a
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1911)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_07:28:27
  host      : fd24cacc2f1a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1910)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Maybe it's the problem of the cuda, I will check this problem in the future, and for now I will use the code I have modified . If you'd like to confirm the error, please let me know.

Thank you for spending time.

hamishivi commented 4 months ago

Thanks for the response! It looks like there is indeed an error in the first case, and the second case seems like flash attention is just not compiled correctly for the cuda version after you change (flash attention can be a bit tricky to install and it's common to get errors if you try to use flash attention installed in one environment in another).

Does using torch 2.2.1 and flash attention 2.5.2 work? We have a PR to upgrade to these versions, and will probably merge it in sometime soon (just need to do some testing and such).

notoookay commented 4 months ago

Sorry for replying late. I tried to use torch 2.2.1 and flash-attn 2.5.2, still got the same error as the first one above.

Maybe it's not the problem of torch or flash-attn, it's more like the issue of accelerate (I guess).

If your working fine, it should be something about my configuration.

notoookay commented 4 months ago

I think I have no further issues, I will close it. Thanks for help!