concating error in collate_fn when replicate otter with mimicit

Hi, I loaded the current MIMIC-IT dataset downloaded from hf hub using 23 mins. And i got an error of stacking expects each tensor to be equal size. The reason is CGD has two imgs for each comparison sample and E4D has 16 frames for each video sample. It seems that a sampler to make all samples in each batch from the same source.

Err:

Loading Mimic-It Datasets: 100%|██████████| 4/4 [22:56<00:00, 344.01s/it]
Total training steps: 930825
  0%|          | 0/930825 [00:02<?, ?it/s]
Error: stack expects each tensor to be equal size, but got [1, 2, 3, 224, 224] at entry 0 and [1, 16, 3, 224, 224] at entry 1
['CGD_INS_053253', 'E4D_INS_000135_879499', 'E4D_04_INS_00390970', 'E4D_08_INS_00728879']

I found your response in #254

unless you set the batch_size to 1, you can not use image and video dataset together because they have different shapes. And what i'm focusing on is otter with mimicit, not otterhd with la and m3it.

The paper and the running script in init weight hub said the batch size is 4. And It seems that full mimicit dataset is used for in-context instruction tuning.

I wanna discuss. What can I do to replicate otter's representative results quickly. (Use all data with more gpu and bs=1? or Use bs=4 but divide the dataset with a source sampler or just launch experiments multiple times? Is all data required to be used?)

I would appreciate it if I could get the author's advice and save some time for perliminary of our research.

Again, thanks for the authors brilliant works!

Below is all my modification. I only modify the data loading, which seems not affect. data yaml (mimicit_data_1219aftfilter_newfmt.yaml)

IMAGE_TEXT_IN_CONTEXT:
  LADD:
    mimicit_path: downloaded_parquet_from_hfhub/LA/LADD_instructions.json
    images_path: 
    - downloaded_parquet_from_hfhub/LA/LA_frames.parquet
    num_samples: -1
  LACR_T2T:
    mimicit_path: downloaded_parquet_from_hfhub/LA/LACR_T2T_instructions.json
    images_path: 
    - downloaded_parquet_from_hfhub/LA/LA_frames.parquet
    num_samples: -1
  LACR_I2I:
    mimicit_path: downloaded_parquet_from_hfhub/LA/LACR_I2I_instructions.json
    images_path: 
    - downloaded_parquet_from_hfhub/LA/LA_frames.parquet
    num_samples: -1
  LACONV:
    mimicit_path: downloaded_parquet_from_hfhub/LA/LACONV_instructions.json
    images_path: 
    - downloaded_parquet_from_hfhub/LA/LA_frames.parquet
    num_samples: -1
  CGD:
    mimicit_path: downloaded_parquet_from_hfhub/CGD/CGD_instructions.json
    images_path: 
    - downloaded_parquet_from_hfhub/CGD/CGD_frames.parquet
    num_samples: -1
  SD:
    mimicit_path: downloaded_parquet_from_hfhub/SD/SD_instructions.json
    images_path: 
    - downloaded_parquet_from_hfhub/SD/SD_frames.parquet
    num_samples: -1
  DC:
    mimicit_path: downloaded_parquet_from_hfhub/DC/DC_instructions_1207_full.json
    images_path: 
    - downloaded_parquet_from_hfhub/DC/DC_frames_1207.parquet
    num_samples: -1
  TVC:
    mimicit_path: downloaded_parquet_from_hfhub/TVC/TVC_instructions.json
    images_path: 
    - downloaded_parquet_from_hfhub/TVC/TVC_frames.parquet
    num_samples: -1
  VST_1219_lqyfilter:
    mimicit_path: downloaded_parquet_from_hfhub/VST/VST_instructions_1219_lqyfilter.json
    images_path: 
    - downloaded_parquet_from_hfhub/VST/VST_frames.parquet
    num_samples: -1
  E4D:
    mimicit_path: downloaded_parquet_from_hfhub/E4D/E4D_instructions_1207_full.json
    images_path: 
    - downloaded_parquet_from_hfhub/E4D/E4D_frames_1207_00.parquet
    - downloaded_parquet_from_hfhub/E4D/E4D_frames_1207_01.parquet
    num_samples: -1
  SN:
    mimicit_path: downloaded_parquet_from_hfhub/SN/SN_instructions.json
    images_path: 
    - downloaded_parquet_from_hfhub/SN/SN_frames.parquet
    num_samples: -1

I update the pre_load and MIMICIT_Dataset to enable parquet loading and multiple folder loading(E4D has two folder).

def preload_dataset(args):
    ...
                # Check if paths exist
                for path_key, path_value in data.items():
                    if path_key.endswith("_path"):
                        if isinstance(path_value, str):
                            if not os.path.exists(path_value):
                                raise ValueError(f"Dataset path {path_value} specified under {category} -> {dataset_name} does not exist.")
                        else:
                            assert isinstance(path_value, list), path_value
                            for folder_name in path_value:
                                for part_fname in os.listdir(folder_name):
                                    if not (part_fname.startswith("part.") and part_fname.endswith(".parquet")):
                                        continue
                                    part_fpath = os.path.join(folder_name, part_fname)
                                    if not os.path.exists(part_fpath):
                                        raise ValueError(f"Dataset path {part_fpath} specified under {category} -> {dataset_name} does not exist.")
    ...
    return dataset_info

for cur_mimicit_path, cur_images_path, cur_train_config_path, sampled_examples, task_name, task_desc in zip(
            self.mimicit_paths,
            self.images_paths,
            self.train_config_paths,
            self.num_samples_list,
            self.task_names,
            self.task_description,
        ):
             ...
            # >>>>> Modified from original code >>>>>
            # if cur_images_path not in ["", []] and \
            #     ((isinstance(cur_images_path, str) and cur_images_path not in loaded_images_path) or (isinstance(cur_images_path, list) and any(p not in loaded_images_path for p in cur_images_path))):
            #     if cur_images_path.endswith(".parquet"):
            #         # if os.path.isdir(cur_images_path):
            #         #     parquet_file = dd.read_parquet(cur_images_path, engine="pyarrow")
            #         # else:
            #         parquet_file = pq.ParquetFile(cur_images_path)
            #         dfs = []  # List to hold the DataFrames of each batch
            #         for batch in parquet_file.iter_batches(batch_size=1000):  # Adjust batch_size as needed
            #             batch_df = batch.to_pandas()
            #             dfs.append(batch_df)
            #         cur_df = pd.concat(dfs)  # Concatenate all DataFrames
            #         self.images.append(cur_df)
            #         loaded_images_path.add(cur_images_path)
            #     elif cur_images_path.endswith(".json"):
            #         with open(cur_images_path, "rb") as f:
            #             cur_df = pd.DataFrame(orjson.loads(f.read()))
            #         self.images.append(cur_df)
            #         loaded_images_path.add(cur_images_path)
            #     else:
            #         master_print(f"Error: {cur_images_path} is not supported!")
            #         import pdb
            #         pdb.set_trace()
            #     del cur_df
            # <<<<< Modified from original code <<<<<
            # >>>>> Refactored by lqy >>>>>
            # 1. enable partitions of parquest files
            # 2. enable multiple partitions folders
            assert isinstance(cur_images_path, list), cur_images_path
            dfs = []
            for folder_path in cur_images_path:
                for part_fname in os.listdir(folder_path):
                    if not (part_fname.startswith("part.") and part_fname.endswith(".parquet")):
                        continue
                    part_fpath = os.path.join(folder_path, part_fname)
                    if part_fpath in loaded_images_path:
                        continue
                    parquet_file = pq.ParquetFile(part_fpath)
                    for batch in parquet_file.iter_batches(batch_size=1000):
                        batch_df = batch.to_pandas()
                        dfs.append(batch_df)
                    loaded_images_path.add(part_fpath)
            if len(dfs) > 0:
                cur_df = pd.concat(dfs)
                self.images.append(cur_df)
                del dfs
                del cur_df
            # <<<<< Refactored by lqy <<<<<

Additionally, I remove the lacked samples in VST, hence the data name is 1219aftfiltered

Luodian / Otter

concating error in collate_fn when replicate otter with mimicit #326