[training] Change current data related args to a yaml file

Description

To simplify the execution of the loading script, there's no requirement to input the paths for all datasets via the command line. Instead, you can facilitate the process by adding the argument --training_data_yaml=pipeline/train/config.yaml.

In the configuration YAML file, structure your entries as follows:

old_command_1:
  - path_to_dataset_1_1
  - path_to_dataset_1_2

old_command_2:
  - path_to_dataset_2_1
  - path_to_dataset_2_2

For instance:

mimicit_vt_path:
  - /data/pufanyi/training_data/SD/SD_instructions.json
  - /data/pufanyi/training_data/CGD/CGD_instructions.json

images_vt_path:
  - /data/pufanyi/training_data/SD/SD.json
  - /data/pufanyi/training_data/CGD/CGD.json

No changes are necessary for the previous script; commands such as mimicit_vt_path remain compatible.

Checklist

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

[x] A descriptive title: [xxx] XXXX
[x] A detailed description
[x] Assign an issue type tag (label):
- dataset (mimic-it download, usage, etc.),
- demo (online demo),
- doc (readme, wiki, paper, video, etc.),
- evaluation (evaluation result, performance of Otter, etc.),
- model (model configuration, components, etc.),
- train (training configuration, process, code, etc.)

Thank you for your contributions!

Luodian / Otter