Open xcvil opened 1 year ago
In 'data/pretrain_data/unify_dataset.py' line 472-475:
if self.split == 'train' and self.dataset.data_cnt % 8 == 0: extra_samples += self.process_pure_text(0) if self.pure_text_dataset else [] extra_samples += self.process_pure_image(0) if self.pure_image_dataset else [] extra_samples += self.process_detection(0) if self.detection_dataset else []
why
self.process_pure_text(0)
0 instead ofindex
?
Yes, this part is really weird. I also need clarification about this. Do you have any thoughts?
In 'data/pretrain_data/unify_dataset.py' line 472-475:
if self.split == 'train' and self.dataset.data_cnt % 8 == 0: extra_samples += self.process_pure_text(0) if self.pure_text_dataset else [] extra_samples += self.process_pure_image(0) if self.pure_image_dataset else [] extra_samples += self.process_detection(0) if self.detection_dataset else []
why
self.process_pure_text(0)
0 instead ofindex
?Yes, this part is really weird. I also need clarification about this. Do you have any thoughts?
For language/detection/image_infilling task, OFA also uses file_dataset
. And there,
def __getitem__(self, index):
if self.data_cnt == self.row_count:
print("reach the end of datafile, start a new reader")
self.data_cnt = 0
self._reader = self._get_reader()
column_l = self._reader.readline().rstrip("\n").split(self.separator)
self.data_cnt += 1
column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]
return column_l
this part might explain why the index=0.
Then it makes sense. Thank you!
Hi, Another question is:
Do you know the meaning of these two hyper-parameters: token_bucket_size and image_bucket_size https://github.com/OFA-Sys/OFA/blob/3222996ac9a9520411b17b9aec319d48908a46c6/models/ofa/ofa.py#L368
Thank you for your kind help.
In 'data/pretrain_data/unify_dataset.py' line 472-475:
why
self.process_pure_text(0)
0 instead ofindex
?