ModelTC / OmniBal

15 stars 0 forks source link

how to generate internvl_sft_1.2M #1

Closed fyting closed 2 months ago

fyting commented 2 months ago

Thank you, great work! However, I'm not sure how the internvl_sft_1.2M.json was generated. Is there a script for it?

yqyao commented 2 months ago

This is the example for the InternVL process example, https://github.com/ModelTC/InternVL/commit/5f5de1d39f2d8ca38f77c4217d43ebb6fd6a4bd9 @fyting.

fyting commented 2 months ago

@yqyao Thank you for your response. In the internvl_chat/tools/data_preprocess_stastics.sh, three parameters are required. Could you please clarify what should be passed as the json_file and token_lengths_path parameters? Specifically, how is the file required for token_lengths_path obtained?

FL77N commented 2 months ago

@fyting token_lengths_path is a folder of the results of token stastics which is absolute path, json_file is original data file and output_path is results file. This command may be helpful: sh data_preprocess_stastics.sh internvl_format_data.json /your/path/token_lengths pack_internvl_format_data.json

fyting commented 2 months ago

@fyting token_lengths_path is a folder of the results of token stastics which is absolute path, json_file is original data file and output_path is results file. This command may be helpful: sh data_preprocess_stastics.sh internvl_format_data.json /your/path/token_lengths pack_internvl_format_data.json

Thank you for your response. My question is how to obtain the file in /your/path/token_lengths, such as https://github.com/ModelTC/OmniBal/blob/main/data/vision/ai2d_train_12k_wh_token_lengths.json. How is this file generated?

fyting commented 2 months ago

@fyting token_lengths_path is a folder of the results of token stastics which is absolute path, json_file is original data file and output_path is results file. This command may be helpful: sh data_preprocess_stastics.sh internvl_format_data.json /your/path/token_lengths pack_internvl_format_data.json

My apologies, I misunderstood earlier. The data_preprocess_stastics.sh script uses internvl_format_data.json to write the statistical results into multiple JSON files in /your/path/token_lengths and pack_internvl_format_data.json. Is that correct?

FL77N commented 2 months ago

@fyting You are right.