Questions about general_dataset.json

Hi @Taekyo-Lee, thanks for your interest in our work.

Why didn't you include other modes than gpt4 and gpt 3.5?

This json file is prepared specifically for instruction fine-tuning the pretrained LLM models. Our responses from other models are available at GitHub. We include only GPT-4 and GPT-3.5 responses to ensure a higher-quality dataset. Responses from smaller models often lack the depth and coherence needed for effective fine-tuning, which could compromise the dataset's overall quality. By focusing on these more advanced models, we aim to provide more reliable data for downstream fine-tuning.

What are the specific versions of gpt4 and gpt 3.5?

We collected responses using gpt4-1106-preview and gpt-3.5-turbo-1106.

Why do some questions appear repeatedly? For instance, the first 20 lines are the same question "Who was the first person to climb Mount Everest?" with 10 times for gpt4 and gpt3.5 each.

As mentioned, this file is designed for instruction tuning. By generating 10 responses per question from both GPT-4 and GPT-3.5, we aim to enhance the dataset's scale, richness and variability, which can be used for fine-tuning models to handle a wide range of possible inputs and scenarios.

VILA-Lab / ATLAS

Questions about general_dataset.json #11