axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.97k stars 879 forks source link

Prediction (Table) Artifacts for MLFlow Logger Not Reporting Any Results #1524

Open DavidFarago opened 7 months ago

DavidFarago commented 7 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

A Prediction (Table) Artifacts with ... appearing in "Evaluation" in MLFlow, see https://github.com/OpenAccess-AI-Collective/axolotl/issues/1505 and https://github.com/OpenAccess-AI-Collective/axolotl/issues/490.

Current behaviour

No Prediction (Table) Artifacts appearing in "Evaluation" in MLFlow: image (1)

Steps to reproduce

1) Add the following to your axolotl.yaml:

wandb_mode: disabled

mlflow_tracking_uri: http://127.0.0.1:5000/
mlflow_experiment_name: mistral_lora_intent_plus_qualification
hf_mlflow_log_artifacts:  true

2) Enable MLFlow on runpod:

pip install mlflow
mkdir -p /workspace/mlflow
echo "Starting mlflow server"
tmux kill-window -t mlflow || true
tmux new-window \
    -n mlflow \
    -c /workspace/mlflow \
    "mlflow ui --host 127.0.0.1 --port 5000; bash"

echo "Enable port forwarding with:"
echo "ssh -L 5000:127.0.0.1:5000 -p $RUNPOD_TCP_PORT_22 root@$RUNPOD_PUBLIC_IP -N"

3) Start finetuning:

cd /workspace/axolotl
accelerate launch -m axolotl.cli.train /workspace/${MODEL_REPO}/axolotl.yml

Config yaml

base_model: # private model
base_model_config: # private model

model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  # several private datasets
dataset_prepared_path: last_run_prepared-sharegpt
val_set_size: 0.05
output_dir: ./qlora-out-sharegpt

adapter: qlora
lora_model_dir:

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_mode: disabled

mlflow_tracking_uri: http://127.0.0.1:5000
mlflow_experiment_name: mistral_lora_intent_plus_qualification
hf_mlflow_log_artifacts:  true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 10
save_steps:
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

I am now looking into adding some test cases to help debug and avoid regressions.

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main/132eb74

Acknowledgements

winglian commented 7 months ago

I believe you need to set something like

eval_table_size: 5
eval_table_max_new_tokens: 128
DavidFarago commented 5 months ago

Thanks, @winglian.

I added

eval_table_size: 5
eval_table_max_new_tokens: 128

and also needed to add eval_sample_packing: false.

Now Training aborts at the first evaluation with the error message (with the INFO message occuring about 100 times):

[2024-04-19 13:52:43,807] [INFO] [axolotl.monkeypatch.mistral._prepare_decoder_attention_mask:113] [PID:4998] [RANK:0] skipping sliding window mask, not broadcastable with attention mask                                                                                                                                                        
  2%|██▎                                                                                                                                 | 3/167 [00:42<38:34, 14.11s/it]
Traceback (most recent call last):                                                                                                       | 3/167 [00:42<40:46, 14.92s/it]
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,                                                                                                                           
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code 
    exec(code, run_globals)                                                                                                                                                File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module>
    fire.Fire(do_cli)                                                                                                                                                    
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 55, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 163, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1837, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2256, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2640, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3473, in evaluate
    self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer_callback.py", line 396, in on_evaluate
    return self.call_event("on_evaluate", args, state, control, metrics=metrics)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer_callback.py", line 414, in call_event
    result = getattr(callback, event)(
  File "/workspace/axolotl/src/axolotl/utils/callbacks/__init__.py", line 737, in on_evaluate
    log_table_from_dataloader("Eval", eval_dataloader)
  File "/workspace/axolotl/src/axolotl/utils/callbacks/__init__.py", line 727, in log_table_from_dataloader
    tracking_uri = AxolotlInputConfig(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/pydantic/main.py", line 171, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for AxolotlInputConfig
  Value error, please set only one of gradient_accumulation_steps or batch_size [type=value_error, input_value={'type_of_model': 'Mistra..._packing_eff_est': 0.97}, inpu
t_type=dict]
    For further information visit https://errors.pydantic.dev/2.6/v/value_error
  1%|          | 1/114 [07:39<14:24:46, 459.17s/it]                                 
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1057, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.10/bin/python3', '-m', 'axolotl.cli.train', '/workspace/mistral_lora_intent_plus_qualification/axolot
l.yml']' returned non-zero exit status 1.