dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.36k stars 209 forks source link

utils/write_<>.py: Is there any way to write to disk on the fly instead of loading the entire dataFrame into memory? #65

Closed zdxdsw closed 2 years ago

zdxdsw commented 2 years ago

Hi!

I have been trying to re-format a dataset into the format that accepts by this repo. However, the training set is too huge. Memory would crash before all binary images are loaded into the dataFrame. Is there any way to save a partial dataFrame to the .arrow file and continuously append to it?

Thanks a lot!

zdxdsw commented 2 years ago

Ok I guess I figured a way out. I wrote my dataset into chunks similar to what you did in write_conceptual_caption.py