Spwaning multiple processes on card 0 when performing distributed training on a single machine with multiple cards？

MengqingCao commented 4 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Only the main process running on card 0

Current behaviour

When performing distributed training on a single machine with multiple cards, e.g., 2 cards, there are 2 processes spwaned on card 0. And this makes OOM error occur frequently.

BTW, using deepspeed also doesn't work for this issue.

Steps to reproduce

accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml

Config yaml

base_model: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter: lora
lora_model_dir:
sequence_len: 512
sample_packing: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./lora-out
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_torch
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
float16: true
bf16: false
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false
gptq_groupsize:
s2_attention:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"



### Possible solution

_No response_

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

3.10

### axolotl branch-commit

main

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 commented 4 weeks ago

The issue could be due to some mistake/old code. There has been deepspeed fixes recently. If you're still interested, would you be able to try them out?

MengqingCao commented 3 weeks ago

The issue could be due to some mistake/old code. There has been deepspeed fixes recently. If you're still interested, would you be able to try them out?

OK, I'll try to fix it

axolotl-ai-cloud / axolotl