Training time - Githubissues

amritasaha1812 / CSQA_Code

59 stars 20 forks source link

Training time #7

Closed sanyam5 closed 5 years ago

sanyam5 commented 6 years ago

Hi @amritasaha1812, I was wondering if you could help me with some training details.

1) For how many epochs must we train the model?

2) How much time does each epoch take? It's been about 12 hours since I put it to train and I still waiting for it to complete the second epoch. I am using two Nvidia 1080 Ti GPUs. I am observing that GPU usage is low most of the time and CPU usage is 100%. Is there a way to somehow use multiple CPUs and increase the training speed?

Thanks,

vardaan123 commented 6 years ago

Responding in place of Amrita

Since this was trained long ago, I don't remember exactly. We train for 3 weeks approx and for us, one epoch took > 24 hrs
You seem to have access to better hardware, and your GPU usage is low (as you say), so you could increase the batch size (given the constraints of GPU memory). CPU memory is mainly used for storing the wikidata dicts, and should not increase with increasing the batch size. The model training takes place in GPU, so I don't see any boost with more CPUs.