Using two 8xH100 nodes to train. encounter error bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above.

michaellin99999 commented 1 month ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

This issue should not occur, as H100 definitely supports bf16.

Current behaviour

outputs error: Value error, bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above. clipboard-image

Steps to reproduce

run the script https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/multi-node.qmd

Config yaml

base_model: openlm-research/open_llama_3b_v2 [0/3]model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter: lora
lora_model_dir:
sequence_len: 1024
sample_packing: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/lora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint::

lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/lora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
gptq_groupsize:
s2_attention:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"

Possible solution

no idea what is causing this issue.

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.11.9

axolotl branch-commit

none

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

michaellin99999 commented 1 month ago

the same settings used in Regular training, works.

michaellin99999 commented 1 month ago

settings in accelerate: compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

michaellin99999 commented 1 month ago

this is the snippet for multinode slave settings: compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 1 main_process_ip: 192.168.108.22 main_process_port: 5000 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

winglian commented 1 month ago

I recommend not using the accelerate config and removing that file. axolotl handles much of that automatically. See https://axolotlai.substack.com/p/fine-tuning-llama-31b-waxolotl-on

michaellin99999 commented 1 month ago

ok, is it the accelerate config causing the issue?

ehartford commented 1 month ago

Often, it is

michaellin99999 commented 1 month ago

we tried that still same issue, also went through https://axolotlai.substack.com/p/fine-tuning-llama-31b-waxolotl-on this requires axolot cloud, Im using my own two 8xh100 clusters. any scripts that work?

NanoCode012 commented 2 weeks ago

@michaellin99999 , hey!

From my understanding, those scripts should work for any systems as Lambda just provides bare compute. Can you let us know if you still get this issue and how we can help solve it?

axolotl-ai-cloud / axolotl