BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
799 stars 61 forks source link

convert raw data to training format #92

Closed acul3 closed 1 month ago

acul3 commented 1 month ago

hello first of all,thanks for code!!

i am planning to train using different language of dataset

you said in the readme

Data preparation We randomly sample 2 million image-text pairs from the coreset and convert them to training format.

can you share step how convert them to training format ?

i am planning to conver laion-2b-multi to training format too

Isaachhh commented 1 month ago

The json file is a list of dictionary, and each element of the list should be like:

{ 'id': '0010278167', # doesn't matter 'image': '0010278167.jpg', # path to the image 'conversations': [ {'from': 'human', 'value': '\<image>\n'}, {'from': 'gpt', 'value': 'Piece of dark jeans fabric Royalty Free Stock Photography'} # the text ] }

acul3 commented 1 month ago

The json file is a list of dictionary, and each element of the list should be like:

{ 'id': '0010278167', # doesn't matter 'image': '0010278167.jpg', # path to the image 'conversations': [ {'from': 'human', 'value': '\n'}, {'from': 'gpt', 'value': 'Piece of dark jeans fabric Royalty Free Stock Photography'} # the text ] }

thanks for the reply

i see some sample on pre train data https://huggingface.co/datasets/BoyaWu10/Bunny-v1_0-data/blob/main/pretrain/bunny_pretrain_laion_2m.json

for example

{
  "id":"0005128268",
  "image":"0005128268.jpg",
  "conversations":[
    {
      "from":"human",
      "value":"<image>\nGive a short and clear explanation of the subsequent image."
    },
    {
      "from":"gpt",
      "value":"First Impressions And Suggested Talent Build Of Reworked Xul In Patch 26 3 Articles Tempo Storm From the recesses of the eastern jungles comes a man cloaked in mystery. tempo storm"
    }
  ]
}

human value has question like Give a short and clear explanation of the subsequent image.

i believe laion 2b subset doesnt provide this (CMIIW)

can you tell how to get this

Isaachhh commented 1 month ago

The question would be deleted when training. code

So, just ignore it.

acul3 commented 1 month ago

sure thank you for your confirmation