OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Apache License 2.0
2.39k stars 248 forks source link

Does pretraning script ONLY pretrain vision-language task? #376

Open xcvil opened 1 year ago

xcvil commented 1 year ago

In 'data/pretrain_data/unify_dataset.py' line 472-475:

if self.split == 'train' and self.dataset.data_cnt % 8 == 0:
        extra_samples += self.process_pure_text(0) if self.pure_text_dataset else []
        extra_samples += self.process_pure_image(0) if self.pure_image_dataset else []
        extra_samples += self.process_detection(0) if self.detection_dataset else []

why self.process_pure_text(0) 0 instead of index?

ZhangYuanhan-AI commented 1 year ago

In 'data/pretrain_data/unify_dataset.py' line 472-475:

if self.split == 'train' and self.dataset.data_cnt % 8 == 0:
        extra_samples += self.process_pure_text(0) if self.pure_text_dataset else []
        extra_samples += self.process_pure_image(0) if self.pure_image_dataset else []
        extra_samples += self.process_detection(0) if self.detection_dataset else []

why self.process_pure_text(0) 0 instead of index?

Yes, this part is really weird. I also need clarification about this. Do you have any thoughts?

xcvil commented 1 year ago

In 'data/pretrain_data/unify_dataset.py' line 472-475:

if self.split == 'train' and self.dataset.data_cnt % 8 == 0:
        extra_samples += self.process_pure_text(0) if self.pure_text_dataset else []
        extra_samples += self.process_pure_image(0) if self.pure_image_dataset else []
        extra_samples += self.process_detection(0) if self.detection_dataset else []

why self.process_pure_text(0) 0 instead of index?

Yes, this part is really weird. I also need clarification about this. Do you have any thoughts?

For language/detection/image_infilling task, OFA also uses file_dataset. And there,

def __getitem__(self, index):
    if self.data_cnt == self.row_count:
        print("reach the end of datafile, start a new reader")
        self.data_cnt = 0
        self._reader = self._get_reader()
    column_l = self._reader.readline().rstrip("\n").split(self.separator)
    self.data_cnt += 1
    column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]
    return column_l

this part might explain why the index=0.

ZhangYuanhan-AI commented 1 year ago

Then it makes sense. Thank you!

ZhangYuanhan-AI commented 1 year ago

Hi, Another question is:

Do you know the meaning of these two hyper-parameters: token_bucket_size and image_bucket_size https://github.com/OFA-Sys/OFA/blob/3222996ac9a9520411b17b9aec319d48908a46c6/models/ofa/ofa.py#L368

Thank you for your kind help.