Closed davidgolub closed 7 years ago
Hi,
Thank you for your interest. I suppose that your are training models with attention. Those models need to load features of size 14x14x2048xbatch_size which can be challenging. That is why we trained our models with multiple SSD raid0 or SSD Pcie. You can try to locate your bottleneck using htop and atop. It could come from your threads or the i/o times.
Please, let me know.
Great, thanks for your feedback!
Also another question--I noticed that when training the model for VQA 2.0 (and if I remember correctly, VQA 1.0 as well) with the default parameters on trainval, the accuracy peaks at around 60%. I assume you have a lot of experience tweaking the hyperparameters in the repo--do you have any intuition why this may be happening? I.e., I increased nans from 2k to 3k, but is it that? Or too much dropout? Any thoughts? The main issue seems to be with questions in the "other" category.
Sorry for the late answer. Are you talking about the training accuracy ? Did you already solved your problem ?
First of all, thank you for open-sourcing your code! It is very useful for my research.
I notice that data loading is quite slow, i.e., out of the total time for a batch (10s, 13s, 5s), data-loading takes about 80% of the time (8s, 11s, 4s) for most batches. Consequently I can only get about 4 epochs in per a day. I use 4 workers and the default options in the repository. Do you have any recommendations on how to speed that part up?
Thanks, David