fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
440 stars 35 forks source link

</s> or eos needed for other base models? #22

Closed zidsi closed 10 months ago

zidsi commented 10 months ago

Compliments for great work. I'm experimenting with reproducing using "small" base model (TinyLlama) with custom low-resource language pairs and results after stage 1 and 2 are promissing.

However I see model to continue generation after translation. repetition_penalty doesn't seem to solve the issue.

I assume that eos should be added to samples at SFT stage or am I missing something else? Any advice is welcome and will be much appreciated.

fe1ixxu commented 10 months ago

Thank you for your interest!

Yes, the special eos token should be added during the SFT stage, as we did here: https://github.com/fe1ixxu/ALMA/blob/b7faf6cfcecd0aa4321e27eaa12eeb30bbf0daeb/utils/utils.py#L241.

Otherwise, the model wouldn't know where to end.

zidsi commented 10 months ago

Thank you for qick and exact response. I'll have to debug why get_preprocessed_data failed to do so (as it seems).