Slow training - Githubissues

VladimirMaksovic commented 5 years ago

I tried to reproduce your results, but the training is very slow on my side. Command I use: python main.py with adam speech_commands gru kwscnn static=True use_mongo=False use_visdom=False

In your paper you mentioned that it takes about one day on a single GPU computer, but on my side it takes 6 minutes per epoch, and there are 200000 of those.

In general an epoch looks like this: INFO - main - ### Sarting epoch n°399 ### No visdom INFO - main - main.py with adam speech_commands gru kwscnn static=True use_mongo=False use_visdom=False Train: 100%|██████████| 348/348 [05:16<00:00, 2.18it/s] Validation: 100%|██████████| 49/49 [00:35<00:00, 4.71it/s] Test: 100%|██████████| 49/49 [00:36<00:00, 1.36it/s] INFO - main - Losses: 2.009(-1.107E+00)-2.046-2.051, Accuracies: 0.533-0.529-0.529, Avg cost: 1.246E+08-1.246E+08-1.246E+08 INFO - main - [198.0, 158.0, 282.0, 347.0, 393.0, 387.0, 397.0] INFO - main - Best Val: 0.533 - Test: 0.528 (Epoch 397.0)

I believe I have running PyTorch with CUDA 10.0. Basic test shows this: Python 3.6.7 (default, Oct 22 2018, 11:32:17) [GCC 8.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import torch nums = torch.randn(2,2) nums.to('cuda:0') tensor([[-0.8770, 0.7333], [-1.0335, -0.7587]], device='cuda:0') exit Use exit() or Ctrl-D (i.e. EOF) to exit

At the same time CPU looks quite busy: 2055 vladimir 20 0 49.229g 1.843g 93092 R 116.3 5.9 0:05.27 python
2057 vladimir 20 0 49.228g 1.842g 93188 R 100.0 5.9 0:05.32 python
2059 vladimir 20 0 49.229g 1.843g 93188 R 100.0 5.9 0:04.96 python
2056 vladimir 20 0 49.226g 1.841g 93148 R 98.1 5.9 0:04.92 python
2058 vladimir 20 0 49.228g 1.842g 93188 R 94.2 5.9 0:05.04 python
2053 vladimir 20 0 49.227g 1.841g 92884 R 91.3 5.9 0:05.19 python
2054 vladimir 20 0 49.230g 1.844g 92652 R 90.4 5.9 0:05.10 python
2052 vladimir 20 0 49.229g 1.843g 92392 R 88.5 5.9 0:04.84 python
31251 vladimir 20 0 49.340g 2.134g 433064 R 14.4 6.9 34:27.64 python

Do you have some idea what am I missing / doing wrong?

TomVeniat commented 5 years ago

Hi Vladimir, Thanks for your detailed issue ! Your Pytorch install seems fine, I didn't run the code with CUDA 10.0 but I don't think the problem comes from here. The current implementation is indeed quite CPU-intensive, this is due to the extraction of the MFCCs before each forward. I think this bottleneck in the current implementation could be the source of the slowdown. How many cores do you have on your server?

In your paper you mentioned that it takes about one day on a single GPU computer, but on my side it takes 6 minutes per epoch, and there are 200000 of those.

Where have you seen this 200000 number ? The model usually takes a few hundreds epochs to converge.

VladimirMaksovic commented 5 years ago

Thanks for you reply.

Yeah... Probably is slow CPU. I have Xeon mobile platform. It is not bad, but it is still laptop. I have other workstation with more power, so I will try on that one.

Sorry for my ignorance... I have seen "nepochs = 200000" and was not thinking about converging process.

TomVeniat commented 5 years ago

Oh your're right, the nepochs = 200000 is just an ugly way to make sure my experiment will run enough time to converge and the actual number here isn't important at all.

TomVeniat / SANAS

Slow training #1