HaozheZhao / MIC

MMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU
335 stars 15 forks source link

训练数据处理过程 #6

Closed buptlihang closed 1 year ago

buptlihang commented 1 year ago

作者您好,非常感谢您的开源工作! 想请问一下,MIC_full中的数据,我下载好图片之后应该怎么处理数据然后用于训练呢,这里有详细的步骤可以参考吗? 感谢

HaozheZhao commented 1 year ago

Hello there. The "input_image" value in the json files represents the image path for the instance's image. Unfortunately, uploading 2.7 million images to Huggingface Hub is challenging. We are currently working on it. Instead, you can use the MIC_Sample provided in the readme, which contains roughly 200,000 of instances.

For data preprocessing, you can utilize the python script "data_preprocess.py" available in the repository. Specifically, pay attention to the functions generate_new_json and to_arrowByDataset. generate_new_json is used to sample data from MIC full dataset. You need to specify the amount of data you want to use using 'data_size' dict and set the dataset path for each dataset using 'data_json' dict. If you wish to generate processed data and directly use it for training purposes, employ the 'to_arrowByDataset' function. It will handle all preprocessing steps and save the dataset as an arrow file that can be loaded directly with Dataset.from_file('path'). Please note that this arrow file stores numpy arrays for both images and text tokens, resulting in approximately 1TB storage requirement for half a million instances. The entire preprocessing process may take around half a day. If you prefer conducting your own datapreprocessing, refer to the 'process_raw_datajson_to_arrow' function. Each input instance consists of three columns: input_text, input_image, and output_text. We have defined preprocess functions as 'preprocess_function' and 'preprocess_function_batched'.