dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.36k stars 208 forks source link

Possible out-of-memory issue of dataloader #31

Closed zhiqiangdon closed 2 years ago

zhiqiangdon commented 2 years ago

Hello,

I have read through your code, but haven't run the code yet. One question about the dataloader implementation. According to

https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L43

You load all the arrow files into memory. The pre-training data have hundreds of gigabytes. Is it possible that this may cause out-of-memory issue? Or does this implementation assume large machine memory?

Thanks,

dandelin commented 2 years ago

Hi @zhiqiangdon,

Apache Arrow's read_all() function is actually doing a lazy loading, so there will be no OOM issue. Though if you call the .to_pandas() method, then Arrow will load the dataset eagerly and you will face the OOM issue.

zhiqiangdon commented 2 years ago

Thanks @dandelin,

I see that you call .to_pandas() on the text column: https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L58 I guess this operation doesn't load the image data, right?

dandelin commented 2 years ago

@zhiqiangdon

Yep, you are right. Arrow is columnar DB, so the data will be loaded column-wise manner.

zhiqiangdon commented 2 years ago

Thanks @dandelin!