UMass-Foundation-Model / FlexAttention

Apache License 2.0
19 stars 4 forks source link

[Question] About the training data llava_v1_5_mix665k_clean_ok.json #4

Open jungle-gym-ac opened 1 month ago

jungle-gym-ac commented 1 month ago

Question

Hi, great work! I noticed that the training data path in the training script you provided is llava_v1_5_mix665k_clean_ok.json, which is not the original llava_v1_5_mix665k.json . Did you do any data cleaning or post-processing to the original json provided by llava? Thank you!

senfu commented 1 month ago

We are still organizing our training data, and will release them soon. Basically we just delete those data that has invalid image (i.e. an empty image by some reasons) and rewrite the text only data so that they can fit into our training pipeline. The data we are using is mostly the same as the original llava_v1_5_mix665k.json.