Closed zidsi closed 10 months ago
Thank you for your interest!
Yes, the special eos token should be added during the SFT stage, as we did here: https://github.com/fe1ixxu/ALMA/blob/b7faf6cfcecd0aa4321e27eaa12eeb30bbf0daeb/utils/utils.py#L241.
Otherwise, the model wouldn't know where to end.
Thank you for qick and exact response. I'll have to debug why get_preprocessed_data failed to do so (as it seems).
Compliments for great work. I'm experimenting with reproducing using "small" base model (TinyLlama) with custom low-resource language pairs and results after stage 1 and 2 are promissing.
However I see model to continue generation after translation. repetition_penalty doesn't seem to solve the issue.
I assume that eos should be added to samples at SFT stage or am I missing something else? Any advice is welcome and will be much appreciated.