VSainteuf / pastis-benchmark

MIT License
195 stars 38 forks source link

Loading dataset into RAM #13

Closed Spiruel closed 2 years ago

Spiruel commented 2 years ago

I am struggling with very low/non-existent GPU usage when trying out PSETAE on the pixel-set dataset. Because of this, I am seeing if this can be fixed by using the cache=True option to load the dataset into memory. Unfortunately this is taking a long time, >1 hour!

Please may I check a) how good the GPU utilisation can be without preloading and b) what reasonable times one can expect it to take to load into memory?

Thanks

Sam

VSainteuf commented 2 years ago

The model is very light and by default trained with batch size of 16, so a low GPU memory usage is not too surprising. Typically for me training a PSE-LTAE with batch size 16 takes only 2GB of VRAM. Is your problem that the GPU is idle for most part of the training ? If yes that suggests indeed a bottleneck in data loading. Are you loading from a SSD or HDD ? Also did you make sure you have enough RAM to have the whole dataset cached ? (I usually take ~40GB of RAM). Normally caching should not add data loading time: caching happens during the first epoch batch by batch, and the next epochs are faster.

Spiruel commented 2 years ago

Thank you for your helpful reply.

Yes, my GPU utilisation is tiny (< 1%) so I'm going to assume the scratch space I'm using to store the data is HDD. The machine has enough RAM to cache the dataset so I'll pursue this instead.

The progress bar to cache the dataset says it'll take about an hour to load it. (I'm using num_workers=0). I'm assuming this is just an issue with my machine, and it's much faster than this for you to cache? Please let me know if you have any recommendations on how to speed this up.

VSainteuf commented 2 years ago

In my setting the first epoch (when caching is performed) takes ~320 seconds to complete, and later epochs only take around 150seconds each. Increasing you number of workers should definitely help.

Spiruel commented 2 years ago

Unfortunately I have a memory leak problem with num_workers>0 at the moment but I'll work on that. Thanks for your help!

VSainteuf commented 2 years ago

you're welcome !