Open williambarberjr opened 4 months ago
we had the same issue, it appears that max_length is somehow hardcoded and did not involve the value input in yml file.
Changing value allow to resolve the issue.
@williambarberjr you could probably pass max_length=8192
in the yml file
datasets:
- path: williambarberjr/L3_8B_Instruct_MarkdownToSummaryConvert
type: chat_template
chat_template: llama3
max_length: 8192
field_messages: messages
message_field_role: role
message_field_content: content
roles:
user:
- user
assistant:
- assistant
system:
- system
we had the same issue, it appears that max_length is somehow hardcoded and did not involve the value input in yml file.
Changing value allow to resolve the issue.
If I remember correctly, I tried this and it didn't work for me but it's possible I failed to rebuilt the package before retrying. Regardless for my next runs I'm likely going to stick with the script I have that prepares my data in type: input_output format going forwards as I know that works and I don't really use the --gradio option to test the model at the end, I've started to default to spinning up vllm. And vllm applies the chat template correctly it seems. So I have a work around but I wanted to put this issue out there so others are aware and maybe eventually we can get it fixed.
Since #1818, the max_length
is set to the sequence_len
parameter.
Please check that this issue hasn't been reported before.
Expected Behavior
I expect to see input/output that combines for a token length of less than 8192 after the chat template is applied to be retained in full as valid training data when examining the arrow files in the /last_run_prepared/ data after running
python -m axolotl.cli.preprocess instruct-lora-8b.yml
Current behaviour
Training data is being cut off at a max length of 2048.
Steps to reproduce
My yml sets
sequence_len: 8192
but the logs keep printingmax_input_len
as having been set to 2048. Even when I alter the source code insrc/axolotl/utils/trainer.py
to hard codemax_input_len: 7192
anddef drop_long_seq(sample, sequence_len=2048, min_sequence_len=2)
todef drop_long_seq(sample, sequence_len=7192, min_sequence_len=2)
, and the log print out confirms at least thatmax_input_len
has been set to 7192 after rebuilding/reinstalling the axolotl, the training data still gets cut off at a length of 2048 tokens. The issue persistently occurs, when using thesedatasets
/chat_template
settings:However, when I set my own custom chat template like this:
And have the jsonl data already prepped with all the correct beginning, ending etc. chat template tokens. It doesn't cut the length at a max of 2048. Here's how I'm printing out the prepared data to check if the template looks correct. First I run
python -m axolotl.cli.preprocess instruct-lora-8b.yml
in the command line then I run this python code:Thoughts on what might be causing this?
Config yaml
Possible solution
Tried several ideas above including hard coding some variables to no avail. For whatever reason the custom chat template described above doesn't reproduce the same cut off training data issue.
Which Operating Systems are you using?
Python Version
Python 3.10.14
axolotl branch-commit
main
Acknowledgements