ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 445 forks source link

Correct data format for fine-tuning RUGPT3 models #109

Open Futyn-Maker opened 1 year ago

Futyn-Maker commented 1 year ago

Hello!

I'm learning how to fine-tune RuGPT3 models with my own dataset to generate similar texts. I'm wondering if there is the documentation describing the right dataset format and a list of special tokens.

The specific questions are following:

  1. The problem is that in my dataset there are both one-line and multiline samples, and I'm wondering how to separate them from each other, as it seems to be assumed that new line is a separator by default.
  2. All the texts in my corpus are of the same type (for example, let's say that these are jokes, but they cannot be combined on some big topics) and I want to generate a new text without a specific input, e.g. I don't assume to give beginning of some text. Should I use in my dataset a keyword like "Анекдот", e.g. "Анекдот: ", and then use this keyword as a prompt? If so, do I need some special token for that word?

I would be grateful for any information on the data format.