VILA-Lab / ATLAS

A principled instruction benchmark on formulating effective queries and prompts for large language models (LLMs). Our paper: https://arxiv.org/abs/2312.16171
Apache License 2.0
899 stars 87 forks source link

Questions about general_dataset.json #11

Open Taekyo-Lee opened 3 weeks ago

Taekyo-Lee commented 3 weeks ago

Hello authors, I have some questions to ask about your _generaldataset.json.

  1. Why didn't you include other modes than gpt4 and gpt 3.5?
  2. What are the specific versions of gpt4 and gpt 3.5?
  3. Why do some questions appear repeatedly? For instance, the first 20 lines are the same question "Who was the first person to climb Mount Everest?" with 10 times for gpt4 and gpt3.5 each.
aidarmyrzakhan commented 2 weeks ago

Hi @Taekyo-Lee, thanks for your interest in our work.

  1. Why didn't you include other modes than gpt4 and gpt 3.5?

This json file is prepared specifically for instruction fine-tuning the pretrained LLM models. Our responses from other models are available at GitHub. We include only GPT-4 and GPT-3.5 responses to ensure a higher-quality dataset. Responses from smaller models often lack the depth and coherence needed for effective fine-tuning, which could compromise the dataset's overall quality. By focusing on these more advanced models, we aim to provide more reliable data for downstream fine-tuning.

  1. What are the specific versions of gpt4 and gpt 3.5?

We collected responses using gpt4-1106-preview and gpt-3.5-turbo-1106.

  1. Why do some questions appear repeatedly? For instance, the first 20 lines are the same question "Who was the first person to climb Mount Everest?" with 10 times for gpt4 and gpt3.5 each.

As mentioned, this file is designed for instruction tuning. By generating 10 responses per question from both GPT-4 and GPT-3.5, we aim to enhance the dataset's scale, richness and variability, which can be used for fine-tuning models to handle a wide range of possible inputs and scenarios.