ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 444 forks source link

Examples of input for training. #30

Closed jaywalkingbackwards closed 3 years ago

jaywalkingbackwards commented 3 years ago

Is there any script or function for preprocessing of the text data? Is it okay to use a train file, that looks like: "abc"\n "def"\n "ghi"\n Or should it be something like {"text":"abc"}\n {"text":"def"}\n {"text":"ghi"}\n

So, can it be a raw text with "\n", or should I convert it into jsonl with only one field "text"? I've seen that "We support three file formats.." but cant find the examples or preprocessors.

Thanks for help!