boun-tabi-LMG / turkish-lm-tuner

Turkish LM Tuner
https://boun-tabi-lmg.github.io/turkish-lm-tuner/
MIT License
73 stars 6 forks source link

Fix dataset_processor's tokenizer_function #38

Closed gokceuludogan closed 6 months ago

gokceuludogan commented 6 months ago

The tokenizer_function cannot differentiate between different model classes: BERT vs T5. The same procedure is applied to both. However, T5 requires [EOS] token while BERT lacks such token.
https://github.com/boun-tabi-LMG/turkish-lm-tuner/blob/3e97efddbec2a834b1e13cdfc3f9dec4f15b820a/turkish_lm_tuner/dataset_processor.py#L111-L125

gokceuludogan commented 6 months ago

Addressed in commit https://github.com/boun-tabi-LMG/turkish-lm-tuner/pull/36/commits/c90605b5b96ad9c8b3284034c4cbc7e4430ca39f.