Open DavidFarago opened 7 months ago
I believe you need to set something like
eval_table_size: 5
eval_table_max_new_tokens: 128
Thanks, @winglian.
I added
eval_table_size: 5
eval_table_max_new_tokens: 128
and also needed to add eval_sample_packing: false
.
Now Training aborts at the first evaluation with the error message (with the INFO
message occuring about 100 times):
[2024-04-19 13:52:43,807] [INFO] [axolotl.monkeypatch.mistral._prepare_decoder_attention_mask:113] [PID:4998] [RANK:0] skipping sliding window mask, not broadcastable with attention mask
2%|██▎ | 3/167 [00:42<38:34, 14.11s/it]
Traceback (most recent call last): | 3/167 [00:42<40:46, 14.92s/it]
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals) File "/workspace/axolotl/src/axolotl/cli/train.py", line 59, in <module>
fire.Fire(do_cli)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 55, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
File "/workspace/axolotl/src/axolotl/train.py", line 163, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1837, in train
return inner_training_loop(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2256, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2640, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3473, in evaluate
self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer_callback.py", line 396, in on_evaluate
return self.call_event("on_evaluate", args, state, control, metrics=metrics)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer_callback.py", line 414, in call_event
result = getattr(callback, event)(
File "/workspace/axolotl/src/axolotl/utils/callbacks/__init__.py", line 737, in on_evaluate
log_table_from_dataloader("Eval", eval_dataloader)
File "/workspace/axolotl/src/axolotl/utils/callbacks/__init__.py", line 727, in log_table_from_dataloader
tracking_uri = AxolotlInputConfig(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/pydantic/main.py", line 171, in __init__
self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for AxolotlInputConfig
Value error, please set only one of gradient_accumulation_steps or batch_size [type=value_error, input_value={'type_of_model': 'Mistra..._packing_eff_est': 0.97}, inpu
t_type=dict]
For further information visit https://errors.pydantic.dev/2.6/v/value_error
1%| | 1/114 [07:39<14:24:46, 459.17s/it]
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1057, in launch_command
simple_launcher(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.10/bin/python3', '-m', 'axolotl.cli.train', '/workspace/mistral_lora_intent_plus_qualification/axolot
l.yml']' returned non-zero exit status 1.
Please check that this issue hasn't been reported before.
Expected Behavior
A Prediction (Table) Artifacts with ... appearing in "Evaluation" in MLFlow, see https://github.com/OpenAccess-AI-Collective/axolotl/issues/1505 and https://github.com/OpenAccess-AI-Collective/axolotl/issues/490.
Current behaviour
No Prediction (Table) Artifacts appearing in "Evaluation" in MLFlow:
Steps to reproduce
1) Add the following to your
axolotl.yaml
:2) Enable MLFlow on runpod:
3) Start finetuning:
Config yaml
Possible solution
I am now looking into adding some test cases to help debug and avoid regressions.
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main/132eb74
Acknowledgements