Open anhuong opened 4 weeks ago
Thanks for making a pull request! 😃 One of the maintainers will review and advise on the next steps.
Thank you @anhuong for this PR to fix the unit tests. Definitely good to have get checkpoint functions
refactored!
EDIT: gradient_accumulation_steps = 1
looks good to me unless gradient_accumulation_steps>1
is specifically tested.
@anhuong my understanding is that we if you set save_strategy="epoch"
, then every epoch you are gauranteed to call a save. this is due to this line here.
And as far as I know logging_strategy
does not affect save_strategy
. So I cant really undertstand why you have missing checkpoints
@fabianlim was curious does number of checkpoints saved gets affected when save_strategy="epoch"
and gradient_accumulation_steps > micro-batch per epoch
(Sample size is 10 and micro-batches is 3). Does checkpoints still gets saved after each epoch performing Optimizer Step (weight update) ?
For example: (Testing log logging each step)
python tuning/sft_trainer.py \
--model_name_or_path Maykeye/TinyLLama-v0 \
--training_data_path tests/data/twitter_complaints_small.jsonl \
--output_dir outputs/full-tuning \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--response_template "\n### Label:" \
--dataset_text_field "output" \
--torch_dtype "float32" \
--logging_strategy "steps" \
--logging_steps 1 \
--save_strategy "epoch"
Sample dataset size is 10 and per_device_train_batch_size is 4 hence micro-batch per epoch is ~3. With gradient_accumulation_steps
= 4, the logs looks like below and hence only 3 checkpoint is saved:
{"data": {"epoch": 1.0, "step": 1, "timestamp": "2024-10-31T21:44:08.164493", "value": 7.909}, "name": "training_loss"}
{"data": {"epoch": 1.33, "step": 2, "timestamp": "2024-10-31T21:44:08.529254", "value": 2.5304}, "name": "training_loss"}
{"data": {"epoch": 2.0, "step": 3, "timestamp": "2024-10-31T21:44:08.854672", "value": 5.1889}, "name": "training_loss"}
{"data": {"epoch": 2.67, "step": 4, "timestamp": "2024-10-31T21:44:09.377428", "value": 5.0882}, "name": "training_loss"}
{"data": {"epoch": 3.0, "step": 5, "timestamp": "2024-10-31T21:44:09.556685", "value": 2.3755}, "name": "training_loss"}
os.listdir(output): ['checkpoint-1', 'training_logs.jsonl', 'checkpoint-3', 'checkpoint-5']
Log with gradient_accumulation_steps
= 1, and rest of parameters same. It saves 5 checkpoints:
{"data": {"epoch": 0.33, "step": 1, "timestamp": "2024-10-31T21:50:10.247606", "value": 10.5645}, "name": "training_loss"}
{"data": {"epoch": 0.67, "step": 2, "timestamp": "2024-10-31T21:50:10.418498", "value": 9.8469}, "name": "training_loss"}
{"data": {"epoch": 1.0, "step": 3, "timestamp": "2024-10-31T21:50:10.589923", "value": 10.9906}, "name": "training_loss"}
{"data": {"epoch": 1.33, "step": 4, "timestamp": "2024-10-31T21:50:10.945404", "value": 9.9601}, "name": "training_loss"}
{"data": {"epoch": 1.67, "step": 5, "timestamp": "2024-10-31T21:50:11.114054", "value": 9.721}, "name": "training_loss"}
{"data": {"epoch": 2.0, "step": 6, "timestamp": "2024-10-31T21:50:11.286315", "value": 10.6751}, "name": "training_loss"}
{"data": {"epoch": 2.33, "step": 7, "timestamp": "2024-10-31T21:50:11.661657", "value": 10.2685}, "name": "training_loss"}
{"data": {"epoch": 2.67, "step": 8, "timestamp": "2024-10-31T21:50:11.837999", "value": 9.573}, "name": "training_loss"}
{"data": {"epoch": 3.0, "step": 9, "timestamp": "2024-10-31T21:50:12.014647", "value": 9.2539}, "name": "training_loss"}
{"data": {"epoch": 3.33, "step": 10, "timestamp": "2024-10-31T21:50:12.397118", "value": 9.781}, "name": "training_loss"}
{"data": {"epoch": 3.67, "step": 11, "timestamp": "2024-10-31T21:50:12.445898", "value": 9.574}, "name": "training_loss"}
{"data": {"epoch": 4.0, "step": 12, "timestamp": "2024-10-31T21:50:12.625669", "value": 9.8564}, "name": "training_loss"}
{"data": {"epoch": 4.33, "step": 13, "timestamp": "2024-10-31T21:50:12.993576", "value": 9.9461}, "name": "training_loss"}
{"data": {"epoch": 4.67, "step": 14, "timestamp": "2024-10-31T21:50:13.177089", "value": 9.5162}, "name": "training_loss"}
{"data": {"epoch": 5.0, "step": 15, "timestamp": "2024-10-31T21:50:13.375828", "value": 9.9178}, "name": "training_loss"}
os.listdir(output): ['checkpoint-3', 'checkpoint-6', 'checkpoint-9', 'training_logs.jsonl', 'checkpoint-12', 'checkpoint-15']
@Abhishek-TAMU @anhuong I cannot reproduce the problem. I run the exact same command that was provided above, and I have 5 checkpoints. In accordance to my reading of the [code] for transformers==4.52.2
(https://github.com/huggingface/transformers/blob/v4.45.2/src/transformers/trainer.py#L2406)
steps_in_epoch=2
step=1
, will set is_last_step_and_steps_less_than_grad_acc=True
. Thus will go into the if
block and optimizer
will step. In other words, GA will only be for 2 steps_maybe_log_save_evaluate
and in there the checkpoint will be saved per epoch.Also I cannot repro your loss, for small models, I need to use a much higher learning rate like 1e-3 to go down to 2ish on 5 epochs
you can turn on TRANSFORMERS_VERBOSITY=info
to see more verbose logs.
$ python tuning/sft_trainer.py --model_name_or_path Maykeye/TinyLLama-v0 --training_data_path tests/data/twitter_complaints_small.jsonl --output_dir outputs/full-tuning --num_train_epochs 5 --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --learning_rate 1e-3 --response_template "\n### Label:" --dataset_text_field "output" --torch_dtype "float32" --use_flash_attn False --logging_strategy "steps" --logging_steps 1 --save_strategy "epoch"
....
***** Running training *****
Num examples = 10
Num Epochs = 5
Instantaneous batch size per device = 4
Training with DataParallel so batch size has been adjusted to: 8
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 4
Total optimization steps = 5
Number of trainable parameters = 4,621,504
0%| | 0/5 [00:00<?, ?it/s]/workspace/fms/run-benches/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
{'loss': 5.3779, 'grad_norm': 16.19341468811035, 'learning_rate': 0.0008, 'epoch': 1.0}
20%|█████████████████████████████████████▍ | 1/5 [00:01<00:05, 1.35s/it]Saving model checkpoint to outputs/full-tuning/checkpoint-1
Configuration saved in outputs/full-tuning/checkpoint-1/config.json
Configuration saved in outputs/full-tuning/checkpoint-1/generation_config.json
Model weights saved in outputs/full-tuning/checkpoint-1/model.safetensors
tokenizer config file saved in outputs/full-tuning/checkpoint-1/tokenizer_config.json
Special tokens file saved in outputs/full-tuning/checkpoint-1/special_tokens_map.json
/workspace/fms/run-benches/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
{'loss': 4.3423, 'grad_norm': 8.393450736999512, 'learning_rate': 0.0006, 'epoch': 2.0}
40%|██████████████████████████████████████████████████████████████████████████▊ | 2/5 [00:01<00:02, 1.48it/s]Saving model checkpoint to outputs/full-tuning/checkpoint-2
Configuration saved in outputs/full-tuning/checkpoint-2/config.json
Configuration saved in outputs/full-tuning/checkpoint-2/generation_config.json
Model weights saved in outputs/full-tuning/checkpoint-2/model.safetensors
tokenizer config file saved in outputs/full-tuning/checkpoint-2/tokenizer_config.json
Special tokens file saved in outputs/full-tuning/checkpoint-2/special_tokens_map.json
/workspace/fms/run-benches/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
{'loss': 3.2777, 'grad_norm': 9.944791793823242, 'learning_rate': 0.0004, 'epoch': 3.0}
60%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 3/5 [00:01<00:00, 2.16it/s]Saving model checkpoint to outputs/full-tuning/checkpoint-3
Configuration saved in outputs/full-tuning/checkpoint-3/config.json
Configuration saved in outputs/full-tuning/checkpoint-3/generation_config.json
Model weights saved in outputs/full-tuning/checkpoint-3/model.safetensors
tokenizer config file saved in outputs/full-tuning/checkpoint-3/tokenizer_config.json
Special tokens file saved in outputs/full-tuning/checkpoint-3/special_tokens_map.json
/workspace/fms/run-benches/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
{'loss': 3.0204, 'grad_norm': 7.56746244430542, 'learning_rate': 0.0002, 'epoch': 4.0}
80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 4/5 [00:01<00:00, 2.81it/s]Saving model checkpoint to outputs/full-tuning/checkpoint-4
Configuration saved in outputs/full-tuning/checkpoint-4/config.json
Configuration saved in outputs/full-tuning/checkpoint-4/generation_config.json
Model weights saved in outputs/full-tuning/checkpoint-4/model.safetensors
tokenizer config file saved in outputs/full-tuning/checkpoint-4/tokenizer_config.json
Special tokens file saved in outputs/full-tuning/checkpoint-4/special_tokens_map.json
/workspace/fms/run-benches/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
{'loss': 2.7926, 'grad_norm': 4.190311908721924, 'learning_rate': 0.0, 'epoch': 5.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 3.36it/s]Saving model checkpoint to outputs/full-tuning/checkpoint-5
Configuration saved in outputs/full-tuning/checkpoint-5/config.json
Configuration saved in outputs/full-tuning/checkpoint-5/generation_config.json
Model weights saved in outputs/full-tuning/checkpoint-5/model.safetensors
tokenizer config file saved in outputs/full-tuning/checkpoint-5/tokenizer_config.json
Special tokens file saved in outputs/full-tuning/checkpoint-5/special_tokens_map.json
Saving model checkpoint to outputs/full-tuning/checkpoint-5
Configuration saved in outputs/full-tuning/checkpoint-5/config.json
Configuration saved in outputs/full-tuning/checkpoint-5/generation_config.json
Model weights saved in outputs/full-tuning/checkpoint-5/model.safetensors
tokenizer config file saved in outputs/full-tuning/checkpoint-5/tokenizer_config.json
Special tokens file saved in outputs/full-tuning/checkpoint-5/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 2.3713, 'train_samples_per_second': 21.085, 'train_steps_per_second': 2.109, 'train_loss': 3.7621720314025877, 'epoch': 5.0}
@fabianlim the error only occurs on transformers v4.46 so agreed not sure how our configuration would cause the number of checkpoints to save when transformers upgrades....That makes sense that logging wouldn't affect the change, I thought it would be easier for us to read and makes more sense to set logging_strategy and save_strategy to the same. But I agree I suspected it could have been something with the gradient_accumulation that caused the issue but I see Fabian you're saying that save_strategy=epoch should still save on each epoch....hmmm
Our unit tests passed with transformesr v4.45 but only started failing with transformers v4.46 because it looks like something changed with how checkpoints are being saved now as described. You can see this recent run of unit tests that ran 8 hours ago: https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/11619798675/job/32360230590
I can recreate this unit test failure locally when running tox
but in a cluster I am not able to recreate the error. Even when specifying no GPUs on a pod, I do see 5 checkpoints getting created instead of less that is seen when running unit tests.
This solution is better to not hard-code in checkpoint-5 but would still be good to check that 5 checkpoints exist, which is why only certain tests fail and not all of them, like the FT ones don't fail because they just check that a checkpoint exists, not looking for a specific one.
I verified when running unit tests that it is all in the gradient_accumulation setting. When this is set to 1, the expected number of checkpoints is created. When GA>1, the number of checkpoints will be less than the number of epochs even though save_strategy="epoch"
is set. But I could only reproduce this when running unit tests and indeed when running on the cluster I get the expected number of checkpoints which is very odd....
@anhuong I cannot reproduce this on transformers=4.46.1
too in the cluster. Can you try to do source /path/to/tox/env/bin/activate
and then pip freeze
inside to check the versions
NOTE: when I run tox i see an upper bound for the transformers
version, so a little confused why this can be due to the new version.
py: install_package_deps> python -I -m pip install 'accelerate!=0.34,<1.1,>=0.20.3' 'datasets<3.0,>=2.15.0' 'numpy<2.0,>=1.26.4' 'peft<0.14,>=0.8.0' 'protobuf<6.0.0,>=5.28.0' 'sentencepiece<0.3,>=0.1.99' 'simpleeval<1.0,>=0.9.13' 'tokenizers<1.0,>=0.13.3' 'torch<2.5,>=2.2.0' 'tqdm<5.0,>=4.66.2' 'transformers<4.50,>4.41' 'trl<1.0,>=0.9.3'
Update: my test passed in .tox
$ tox --workdir $HOME/fms -e py \
-- tests/test_sft_trainer.py::test_resume_training_from_checkpoint
.pkg: _optional_hooks> python /workspace/.local/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta
.pkg: get_requires_for_build_sdist> python /workspace/.local/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta
.pkg: build_sdist> python /workspace/.local/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta
py: install_package> python -I -m pip install --force-reinstall --no-deps /workspace/fms/.tmp/package/4/fms_hf_tuning-2.1.1rc1.tar.gz
py: commands[0]> pytest tests/test_sft_trainer.py::test_resume_training_from_checkpoint
==================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.3, pluggy-1.5.0
cachedir: /workspace/fms/py/.pytest_cache
rootdir: /data/flim/fms-hf-tuning
configfile: pytest.ini
collected 1 item
tests/test_sft_trainer.py . [100%]
====================================================================================================== warnings summary ======================================================================================================
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/transformers/training_args.py:2041: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
warnings.warn(
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': dataset_text_field, max_seq_length, dataset_kwargs. Will not be supported from version '1.0.0'.
Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
warnings.warn(message, FutureWarning)
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:283: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:321: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:327: UserWarning: You passed a `dataset_kwargs` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
warnings.warn(
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:396: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
warnings.warn(
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:401: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
super().__init__(
tests/test_sft_trainer.py: 10 warnings
/workspace/fms/py/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/transformers/trainer.py:3347: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
tests/test_sft_trainer.py::test_resume_training_from_checkpoint
/workspace/fms/py/lib/python3.10/site-packages/transformers/trainer.py:3026: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint_rng_state = torch.load(rng_file)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================== 1 passed, 26 warnings in 10.10s ===============================================================================================
py: OK (17.06=setup[5.74]+cmd[11.32] seconds)
congratulations :) (17.13 seconds)
Sharing my test log with library version of accelerate, transformers, trl
in test environment :
tox -e py -- tests/test_sft_trainer.py::test_resume_training_from_checkpoint
@fabianlim
Running tox -e py
with GA=4, logging_steps=1, logging_strategy not set, running test_sft_trainer.py only I get 12 failed unit tests where the number of checkpoints does not match the number of epochs as expected:
======================================== short test summary info ========================================
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint_with_flag_true - assert 9.666666666666666 == (3.6666666666666665 + 5)
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint_with_flag_false - assert 0 == 1
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint_with_flag_checkpoint_path_lora - assert 29959084032.0 == 22686858240.0
FAILED tests/test_sft_trainer.py::test_run_causallm_pt_and_inference - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[default] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[custom_target_modules] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[all_linear_target_modules] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_and_inference[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_small.jsonl] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_and_inference[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_small.json] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_pretokenized[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.jsonl] - AssertionError: assert 4 == 6
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_pretokenized[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.json] - AssertionError: assert 4 == 6
================== 12 failed, 37 passed, 2 skipped, 336 warnings in 121.46s (0:02:01) ===================
You can see with print statements the checkpoints:
{'loss': 17.8934, 'grad_norm': 22.092439651489258, 'learning_rate': 1e-05, 'epoch': 1.67}
{'loss': 17.9788, 'grad_norm': 22.23282241821289, 'learning_rate': 8.535533905932739e-06, 'epoch': 3.67}
{'train_runtime': 1.7212, 'train_samples_per_second': 29.049, 'train_steps_per_second': 2.905, 'train_tokens_per_second': 1661.61, 'train_loss': 22.376200675964355, 'epoch': 3.67}
['checkpoint-1', 'training_logs.jsonl', 'checkpoint-0', 'checkpoint-2']
Compared to a run with "transformers>4.41,<4.46"
so the only diff between the two pip freezes is transformers, the only difference is in the checkpoint numbers but the number of checkpoints still fails for 9 tests.
======================================== short test summary info ========================================
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_pt_and_inference - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[default] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[custom_target_modules] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[all_linear_target_modules] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_and_inference[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_small.jsonl] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_and_inference[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_small.json] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_pretokenized[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.jsonl] - AssertionError: assert 4 == 6
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_pretokenized[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.json] - AssertionError: assert 4 == 6
=================== 9 failed, 40 passed, 2 skipped, 282 warnings in 118.76s (0:01:58) ===================
Print checkpoints:
{'loss': 6.8061, 'grad_norm': 8.93274211883545, 'learning_rate': 1e-05, 'epoch': 1.0}
{'loss': 3.3769, 'grad_norm': 5.7982940673828125, 'learning_rate': 5e-06, 'epoch': 2.0}
{'loss': 3.3822, 'grad_norm': 4.714280605316162, 'learning_rate': 0.0, 'epoch': 3.0}
{'train_runtime': 1.8403, 'train_samples_per_second': 27.169, 'train_steps_per_second': 2.717, 'train_tokens_per_second': 1554.079, 'train_loss': 4.064878463745117, 'epoch': 3.0}
['checkpoint-1', 'training_logs.jsonl', 'checkpoint-5', 'checkpoint-3']
And so previously it only succeeded because checkpoint-5 exists whereas now it is checkpoint-3 but the number of checkpoints remains the same.
Before when I ran in the cluster, I was using a larger dataset with sample size 50 and did not see this behavior occur, 5 checkpoints were always saved.
My cluster runs:
$ python -m tuning.sft_trainer --model_name_or_path Maykeye/TinyLLama-v0 --training_data_path /app/twitter_complaints_small.json --output_dir /tmp/test-transformers-446-lora --num_train_epochs 5 --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --learning_rate 1e-5 --response_template "\n### Label:" --dataset_text_field "output" --torch_dtype "float32" --logging_steps 1 --save_strategy "epoch" --use_flash_attn false --per_device_eval_batch_size 4 --weight_decay 0 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --include_tokens_per_second true --packing false --peft_method lora
# GA=1, dataset_sample_size=10
/tmp/test-transformers-446-lora-ga-1:
added_tokens_info.json checkpoint-15 checkpoint-6 training_logs.jsonl
checkpoint-12 checkpoint-3 checkpoint-9
# GA=2, dataset_sample_size=10
/tmp/test-transformers-446-lora-ga-2:
added_tokens_info.json checkpoint-1 checkpoint-3 checkpoint-4 checkpoint-5 training_logs.jsonl
# GA=4, dataset_sample_size=10
/tmp/test-transformers-446-lora-no-logging:
added_tokens_info.json checkpoint-0 checkpoint-1 checkpoint-2 training_logs.jsonl
# GA=4, dataset_sample_size=50
/tmp/test-transformers-446-lora-no-logging-50-samples:
added_tokens_info.json checkpoint-15 checkpoint-6 training_logs.jsonl
checkpoint-13 checkpoint-3 checkpoint-9
In addition we see the training end early when GA=4 and dataset samples size=10
@Abhishek-TAMU @anhuong I cannot reproduce the problem. I run the exact same command https://github.com/foundation-model-stack/fms-hf-tuning/pull/383#issuecomment-2451103996, and I have 5 checkpoints. In accordance to my reading of the [code] for transformers==4.52.2(https://github.com/huggingface/transformers/blob/v4.45.2/src/transformers/trainer.py#L2406)
10 samples, GA=4, so steps_in_epoch=2 for each epoch, on the second step (i.e. step=1, will set is_last_step_and_steps_less_than_grad_acc=True. Thus will go into the if block and optimizer will step. In other words, GA will only be for 2 steps after the two steps of each epoch, it will coem to _maybe_log_save_evaluate and in there the checkpoint will be saved per epoch.
@fabianlim With logging statements I found out steps_in_epoch
is 3 and not 2. (10 samples, GA=4, batch_size = 4).
But num_batches = 2
as subsequent variable values are remainder = 2
, total_updates = 1
and hence only 2 batch_samples is there in this loop where the value of step
would be 0 and 1 and hence is_last_step_and_steps_less_than_grad_acc
is not becoming True for any step
value in few epochs (3 out of 5 epochs) and that's why self._maybe_log_save_evaluate is not called for those epochs and hence not saving checkpoint for that epoch.
Description of the change
Unit tests were failing on transformers v4.46 with errors:
assert 9.666666666666666 == (3.6666666666666665 + 5)
where the epochs being logged didn't matchcheckpoint-5
but the checkpoints were only saving 3 checkpoints, not matching the number of epochs.Because of this I updated our trainingArguments parameters to use
gradient_accumulation_steps=1
to create the number of checkpoints expected for the small dataset we are using andlogging_strategy="epoch"
so that the loss logs would print per each full epoch. ~When setting gradient_accumulation_steps>1 the loss values would be fractions of epochs. I suspect that because the logging was set to partial epochs that the checkpoints being saved may have also been off.~This is more consistent with
save_strategy="epoch"
that we have set and I removedlogging_steps=1
since we are logging based on epochs.The loss logging and then a print statement I added that for
os.listdir(output_dir)
showed:As you can see only 3 checkpoints are saved including one for checkpoint-0, likely because of the partial epochs being logged.
With these changes, the loss logging and
os.listdir(output_dir)
shows:I'm not sure how the number is determined for
checkpoint-<number>
, but it's better to not hardcodecheckpoint-5
so instead I refactored our existing code to get the highest checkpoint and return it.Related issue number
How to verify the PR
I verified that unit tests pass with transformers v4.45 and v4.46.
Was the PR tested