MengqingCao commented 1 month ago

Description

Enable Ascend NPU backend for finetuning, inferencing and gradio webui. Main changes:

modify the hard code related to cuda and abstract to device
add NPU related configure constraints
Motivation and Context

There are two benefits:

Abstracting device make sense for more backends to plugin, and Ascend NPU is a good example.
Allow Ascend NPU users to use axolotl for LLM finetuning, inferencing

Example

# preprocess datasets - optional but recommended
ASCEND_RT_VISIBLE_DEVICES=0 python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml

# finetune lora
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml

# inference
accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
    --lora_model_dir="./lora-out"

# gradio
accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
    --lora_model_dir="./lora-out" --gradio

Screenshots

NPU supported CLI inference

axolotl_cli_chat

NPU supported Gradio webui inference

axolotl_cli_chat_gradio

Config

lora.yaml

base_model: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: true
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca
dataset_prepared_path:
val_set_size: 0.02
adapter: lora
lora_model_dir:
sequence_len: 1024
sample_packing: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
output_dir: ./outputs/lora-out
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_torch
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
float32: true
bf16: false
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank: 0
logging_steps: 1
xformers_attention:
flash_attention: false
gptq_groupsize:
s2_attention:
gptq_model_v1:
warmup_steps: 20
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

MengqingCao commented 1 month ago

Good day! @winglian I tried to create a class ModelKwargs, but with the modification of model_kwargs, there are many other operations such as patching, creating models, etc. And their judgment conditions seem inseparable.

Thus, finnaly I refactor the whole load_model func into a class ModelLoader. All the operations in original load_model func have been placed in several member functions and followed the original logical order.

This brings a lot changes, while making the model loading pipeline more clearly. Moreover, the changes of member variables such as model_kwargs are more obvious. But I am not sure whether the current function naming and pipeline splitting method is completely reasonable.

Please review the latest code and give me some suggestions. Thanks a lot!

MengqingCao commented 1 month ago

Hi, @winglian Could you help review the latest code in this PR? Let me know if the breaks brings by refactoring of the original code is not you want.

Just FYI, I accidentally deleted the original commit, and it cann be found in this branch.

axolotl-ai-cloud / axolotl

Enable Ascend NPU support #1758

Description

Motivation and Context

Example

Screenshots

NPU supported CLI inference

NPU supported Gradio webui inference

Config