Closed fozziethebeat closed 3 weeks ago
I'm having a similar issue for mistralai/Mistral-7B-Instruct-v0.3
tokenization goes wrong. Most samples are dropped in Drop Samples with Zero Trainable Tokens
step (even after pulling the latest repo after the PR being merged). No clue why particularly this mistral.
I added a fix to this problem in this pr. Can you try adding another unittest that's similar to the phi-3.5 test I added and see if the same behavior is happening?
Thank you for this
Hey @fozziethebeat , since the PR has been merged, should this Issue be closed?
Yes~
Please check that this issue hasn't been reported before.
Expected Behavior
When tokenizing a simple dataset (
fozziethebeat/alpaca_messages_2k_test
) usingmicrosoft/Phi-3.5-mini-instruct
we should expect the last assistant turn and the end of turn tokens to all be included in the labels.We'd expect something like
Note
<|end|>(32007, 32007)
Current behaviour
Due to the advanced functionality of #1756 the last end of turn token is masked out. I think this is due to a new set of defaults for these advanced per-turn masking features conflicting with how phi-3.5 configures its end of turn and end of sentence tokens (they're different).
Currently we get
Note the
<|end|>(-100, 32007)
Steps to reproduce
<|end|>
tokenConfig yaml
Possible solution
this pr
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main/3853ab7ae9220dfbd78cd628e54fde75fb89df97
Acknowledgements