Closed zhiqiangdon closed 2 years ago
Hi @zhiqiangdon,
Apache Arrow's read_all()
function is actually doing a lazy loading, so there will be no OOM issue.
Though if you call the .to_pandas()
method, then Arrow will load the dataset eagerly and you will face the OOM issue.
Thanks @dandelin,
I see that you call .to_pandas()
on the text column:
https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L58
I guess this operation doesn't load the image data, right?
@zhiqiangdon
Yep, you are right. Arrow is columnar DB, so the data will be loaded column-wise manner.
Thanks @dandelin!
Hello,
I have read through your code, but haven't run the code yet. One question about the dataloader implementation. According to
https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L43
You load all the arrow files into memory. The pre-training data have hundreds of gigabytes. Is it possible that this may cause out-of-memory issue? Or does this implementation assume large machine memory?
Thanks,