How much CPO data set is expected to be needed when creating a one-to-one machine translator?

fe1ixxu / ALMA

State-of-the-art LLM-based translation models.

MIT License

440 stars 35 forks source link

How much CPO data set is expected to be needed when creating a one-to-one machine translator? #26

Closed qwopqwop200 closed 9 months ago

qwopqwop200 commented 9 months ago

Thank you for your amazing work.

I am thinking of creating a bidirectional translator using ALMA-R that supports only single pairs. How many CPO datasets do you expect to need for this?

Do I need 22k datasets like in ALMA-R? Or is a smaller number of data sufficient?

fe1ixxu commented 9 months ago

Thank you for the interest!

If you want ALMA to support languages beyond German, Chinese, Czech, Russian and Icelandic (what ALMA originally supported), The best way is to firstly fine-tune the monolingual data on your target language. If your target language is one of them, a smaller dataset like 2K CPO data should be totally fine.

qwopqwop200 commented 9 months ago

Thank you for the interest!

If you want ALMA to support languages beyond German, Chinese, Czech, Russian and Icelandic (what ALMA originally supported), The best way is to firstly fine-tune the monolingual data on your target language. If your target language is one yof them, a smaller dataset like 2K CPO data should be totally fine.

I already finished the training of ALMA based on ko-solar 10.7b and now I just need to fine tune it with CPO data. Would 2k be sufficient in such a case?

fe1ixxu commented 9 months ago

Yes, it should be sufficient. But please be careful of the quality of the CPO data.