Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.07k stars 1k forks source link

LIMA multiturn dialogues not working correctly? #1504

Open Nanayeb34 opened 3 months ago

Nanayeb34 commented 3 months ago

It was stated to use the follow up questions in the multi turn dialogues for LIMA, you would have to set --data.include_multiturn_conversations True. I included that and compared it with the original data. It seems only the first instruction-response pair is selected. The follow up pairs are not included in the generated json.

steps to reproduce the dataset creation

def format_dataset(dataset_partition: dict, include_multi_turn_conversations: bool) -> List[dict]:
    formatted_ds = []

    for entry in dataset_partition:
        convo = entry["conversations"]
        if include_multi_turn_conversations:
            for i in range(0, len(convo) - 1, 2):
                formatted_ds.append({"instruction": convo[i], "input": "", "output": convo[i + 1]})
        else:
            formatted_ds.append({"instruction": convo[0], "input": "", "output": convo[1]})

    return formatted_ds

lima=load_dataset('GAIR/lima',token=)
formatted_ds = format_dataset(lima['train'], include_multi_turn_conversations=True)
with open('new_lima_ds.json', 'w') as f:
    json.dump(formatted_ds, f,indent=4)

you can find the generated file here-new_lima_ds.json.

i am curious to know if the --data.include_multiturn_conversations True actually works and the expected output because i don't think it includes the follow up response-pairs.

rasbt commented 3 months ago

Thanks for raising this, something to look into. Could you print out the data inputs that are fed to the LLM to better see the issue? On that note, also, I think not all of the entries have multi-turn answers

Nanayeb34 commented 2 months ago

Hi @rasbt . Thanks for following up on this. Yes, not all the entries have multi turn answers. The last 30 entries in the dataset are the ones with multi turn answers according to the LIMA paper.

when you mention print out data inputs fed to the LLM, do you mean print out some samples when i am running the code below?

litgpt finetune lora \
  --data LIMA \
  --data.include_multiturn_conversations True
  --checkpoint_dir "/content/lima"