Finetuning example - Githubissues

IntelLabs / fastRAG

Efficient Retrieval Augmentation and Generation Framework

Apache License 2.0

1.29k stars 116 forks source link

Finetuning example #21

Closed karrtikiyer-tw closed 9 months ago

karrtikiyer-tw commented 1 year ago

it would be greatly beneficial if there is an example to demonstrate how to use the fine tuning script with a custom proprietary data set. Thanks in advance.

karrtikiyer-tw commented 1 year ago

I see the command to fine tune in the MODELS.MD FID section, however it would be great if an example is shown with a custom dataset, addressing the number of examples needed to train etc...

karrtikiyer-tw commented 1 year ago

@peteriz , @mosheber , @danielfleischer : Can anyone help or advise here?

mosheber commented 1 year ago

Hi @karrtikiyer-tw ! Thank you for bringing this up. We have added to the description written in the models.md file, regarding the data format:

{
  "id": 0,
  "question": "The question for the current example",
  "target": "Target answer",
  "answers": ["Target answer", "Possible answer #1", "Possible answer #2"],
  "ctxs": [
            {
                "title": "Title of passage #1",
                "text": "Context of passage #1"
            },
            {
                "title": "Title of passage #2",
                "text": "Context of passage #2"
            },
            ...
          ]
}

Thus, any dataset containing a triplet of (question, answers, passages) can be used.

karrtikiyer-tw commented 1 year ago

Thanks @mosheber , can you also advise on the volume of data typically needed for fine tuning to yield some decent results?

mosheber commented 1 year ago

@karrtikiyer-tw the size of the training dataset for NQ in the original repository is about 79k examples. Other versions of this dataset and its equivalents (such as TriviaQA), are of a similar size. This is for short answers, however, mainly concerning answers such as entities, dates, and others. For long answers, such as ELI5, they have about 270k examples, which is significantly more.