PygmalionAI / data-toolbox

Our data munging code.
GNU Affero General Public License v3.0
34 stars 9 forks source link

multilingual #20

Closed g3434343 closed 1 year ago

g3434343 commented 1 year ago

Hi guys.

If I translate the datasets, will they work with pygmalion? I want to translate the datasets into portuguese.

TearGosling commented 1 year ago

Hi there!

In short, they should! Keep in mind, however, that LLaMA. GPT-NeoX and other similar models of that nature tend to have the majority of their data be in English. This means their tokenizer also tends to be optimized for tokenizing English sentences. For best results, it may be best to tokenize the dataset with a tokenizer for a model that speaks Portuguese and then feed it into a base model which is also trained in Portuguese - but English models and tokenizers could work for it too, I suppose it depends on the amount of data you'll feed it. Good luck with this, whatever you do!

And of course, if there's any bugs that pop up, feel free to open an issue and/or PR, and we'll take a look at it.

g3434343 commented 1 year ago

Oh, ok, dear. I'll try changing the tokenizer to portuguese, collect the datasets and translate a few of them and see what I get. If I can make it work, I'll document the process.

TearGosling commented 1 year ago

Best of luck to you! For now, I'll close this issue.