训练数据处理过程

Hello there. The "input_image" value in the json files represents the image path for the instance's image. Unfortunately, uploading 2.7 million images to Huggingface Hub is challenging. We are currently working on it. Instead, you can use the MIC_Sample provided in the readme, which contains roughly 200,000 of instances.

For data preprocessing, you can utilize the python script "data_preprocess.py" available in the repository. Specifically, pay attention to the functions generate_new_json and to_arrowByDataset. generate_new_json is used to sample data from MIC full dataset. You need to specify the amount of data you want to use using 'data_size' dict and set the dataset path for each dataset using 'data_json' dict. If you wish to generate processed data and directly use it for training purposes, employ the 'to_arrowByDataset' function. It will handle all preprocessing steps and save the dataset as an arrow file that can be loaded directly with Dataset.from_file('path'). Please note that this arrow file stores numpy arrays for both images and text tokens, resulting in approximately 1TB storage requirement for half a million instances. The entire preprocessing process may take around half a day. If you prefer conducting your own datapreprocessing, refer to the 'process_raw_datajson_to_arrow' function. Each input instance consists of three columns: input_text, input_image, and output_text. We have defined preprocess functions as 'preprocess_function' and 'preprocess_function_batched'.

HaozheZhao / MIC

训练数据处理过程 #6