Open michaellin99999 opened 1 month ago
the same settings used in Regular training, works.
settings in accelerate: compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
this is the snippet for multinode slave settings: compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 1 main_process_ip: 192.168.108.22 main_process_port: 5000 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
I recommend not using the accelerate config and removing that file. axolotl handles much of that automatically. See https://axolotlai.substack.com/p/fine-tuning-llama-31b-waxolotl-on
ok, is it the accelerate config causing the issue?
Often, it is
we tried that still same issue, also went through https://axolotlai.substack.com/p/fine-tuning-llama-31b-waxolotl-on this requires axolot cloud, Im using my own two 8xh100 clusters. any scripts that work?
@michaellin99999 , hey!
From my understanding, those scripts should work for any systems as Lambda just provides bare compute. Can you let us know if you still get this issue and how we can help solve it?
Please check that this issue hasn't been reported before.
Expected Behavior
This issue should not occur, as H100 definitely supports bf16.
Current behaviour
outputs error: Value error, bf16 requested, but AMP is not supported on this GPU. Requires Ampere series or above.
Steps to reproduce
run the script https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/multi-node.qmd
Config yaml
Possible solution
no idea what is causing this issue.
Which Operating Systems are you using?
Python Version
3.11.9
axolotl branch-commit
none
Acknowledgements