Open hubenjm opened 1 month ago
Hi, thanks for using VILA. If you click the link in data_prepare/README.md
, gsm8k-ScRel will refer you to the annotation file. Instance from this file contains one "query" and one "response" fields. We simply format them into the following format:
{'id': 0, 'question': 'Q:Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?\nA:', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72', 'image': []}
Thanks for the clarification. Hopefully you could add these details to the README.md file in a future commit.
In https://github.com/Efficient-Large-Model/VILA/blob/main/llava/data/datasets_mixture.py#L171C5-L171C6 the math dataset is described as type 'vflan'. However, in
data_prepare/README.md
it isn't clear what corresponds to that. I'm guessing it is GSM8K-ScRel-SFT. But the format of the annotation file https://github.com/OFA-Sys/gsm8k-ScRel/blob/main/data/train_use.jsonl does not directly work with the LazyVFlanDataset class (https://github.com/Efficient-Large-Model/VILA/blob/d7d54bc4ca1e582f59516ba2f94a0217ad2430a0/llava/data/dataset.py#L1313), as it expects multiple .pkl files to live inside thedata_path
directory. Any elaboration on how you formatted the original train_use.jsonl file into .pkl files or if some other approach was used?