fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
352 stars 26 forks source link

</s> or eos needed for other base models? #22

Closed zidsi closed 5 months ago

zidsi commented 5 months ago

Compliments for great work. I'm experimenting with reproducing using "small" base model (TinyLlama) with custom low-resource language pairs and results after stage 1 and 2 are promissing.

However I see model to continue generation after translation. repetition_penalty doesn't seem to solve the issue.

I assume that eos should be added to samples at SFT stage or am I missing something else? Any advice is welcome and will be much appreciated.

fe1ixxu commented 5 months ago

Thank you for your interest!

Yes, the special eos token should be added during the SFT stage, as we did here: https://github.com/fe1ixxu/ALMA/blob/b7faf6cfcecd0aa4321e27eaa12eeb30bbf0daeb/utils/utils.py#L241.

Otherwise, the model wouldn't know where to end.

zidsi commented 5 months ago

Thank you for qick and exact response. I'll have to debug why get_preprocessed_data failed to do so (as it seems).