Closed fyting closed 2 months ago
This is the example for the InternVL process example, https://github.com/ModelTC/InternVL/commit/5f5de1d39f2d8ca38f77c4217d43ebb6fd6a4bd9 @fyting.
@yqyao Thank you for your response. In the internvl_chat/tools/data_preprocess_stastics.sh, three parameters are required. Could you please clarify what should be passed as the json_file and token_lengths_path parameters? Specifically, how is the file required for token_lengths_path obtained?
@fyting token_lengths_path is a folder of the results of token stastics which is absolute path, json_file is original data file and output_path is results file. This command may be helpful: sh data_preprocess_stastics.sh internvl_format_data.json /your/path/token_lengths pack_internvl_format_data.json
@fyting token_lengths_path is a folder of the results of token stastics which is absolute path, json_file is original data file and output_path is results file. This command may be helpful:
sh data_preprocess_stastics.sh internvl_format_data.json /your/path/token_lengths pack_internvl_format_data.json
Thank you for your response. My question is how to obtain the file in /your/path/token_lengths, such as https://github.com/ModelTC/OmniBal/blob/main/data/vision/ai2d_train_12k_wh_token_lengths.json. How is this file generated?
@fyting token_lengths_path is a folder of the results of token stastics which is absolute path, json_file is original data file and output_path is results file. This command may be helpful:
sh data_preprocess_stastics.sh internvl_format_data.json /your/path/token_lengths pack_internvl_format_data.json
My apologies, I misunderstood earlier. The data_preprocess_stastics.sh script uses internvl_format_data.json to write the statistical results into multiple JSON files in /your/path/token_lengths and pack_internvl_format_data.json. Is that correct?
@fyting You are right.
Thank you, great work! However, I'm not sure how the internvl_sft_1.2M.json was generated. Is there a script for it?