Discussion on preprocessing of LAION data

bokyeong1015 commented 1 year ago

I have another question.

I split the LAION-aesthetic V2 5+ dataset into several subsets, e.g., 5M, 10M, 89M, etc, and I made metadata.csv for each subset.

Then, when I tried to train with multi-gpus using the subset dataset, I faced the below error.

I guess that the problem was caused by the data itself.

FYI, I didn't pre-process the data except for resolution (512x512) when I downloaded data.

Did you also face this problem?

Or did you conduct any pre-processing of the LAION data??

Steps: 0%| | 283/400000 [35:52<813:24:06, 7.33s/it, kd_feat_loss=58.6, kd_output_loss=0.0447, lr=5e-5, sd_loss=0.185, step_loss=58.9] Traceback (most recent call last): File "/home/user01/bk-sdm/src/kd_train_text_to_image.py, line 1171, in main() File "/home/user01/bk-sdm/src/kd_train_text_to_image.py", line 961, in main for step, batch in enumerate(train_dataloader): File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/accelerate/data_loader.py", line 388, in iter next_batch = next(dataloader_iter) File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch data = self.dataset.getitems(possibly_batched_index) File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in getitems return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] IndexError: index 63 is out of bounds for dimension 0 with size 63

bokyeong1015 commented 1 year ago

@youngwanLEE, please find our response below:

Did you also face this problem?

We haven’t encountered such an error IndexError: index 63 is out of bounds for dimension 0 with size 63

Did you conduct any pre-processing of the LAION data??

We removed some problematic image-text pairs (empty text files and PIL-unreadable images); however, this led to different error messages from yours.

We’ve tried to reproduce this error (by changing batch sizes under a multi-gpu setting, adding empty lines in metadata.csv, using very long text prompts or multi-line prompts), but eventually failed to do so (no or different errors occurred).

Could you provide more context about this error? It would be very appreciated if you could generously share your update and/or solution to this issue.

youngwanLEE commented 1 year ago

@bokyeong1015 thanks for your effort :)

I finally solved this problem.

The problem was caused by empty text files in the dataset.

When I filtered the empty text pairs, the problem was solved.

From now on, I started to train models on larger datasets over 10M image-text pairs.

Thanks again :)

It would be ok to close this issue.

bokyeong1015 commented 1 year ago

Great, thanks for sharing! Hope your training goes well :)

Nota-NetsPresso / BK-SDM

Discussion on preprocessing of LAION data #32