Unable to train on ubuntu corpus

AjayChoudary commented 7 years ago

I downloaded the ubuntu corpus as mentioned Readme.md and ran below command but is getting killed while training without any error.

./main.py --corpus ubuntu --modelTag ubuntufull

Welcome to DeepQA v0.1 !

TensorFlow detected: v0.12.1 Training samples not found. Creating dataset...

Ubuntu dialogs subfolders: 100%|█████████████████████| 350/350 [1:49:50<00:00, 18.83s/it] Extract conversations: 100%|█████████████████| 1852868/1852868 [1:28:19<00:00, 349.62it/s] Saving dataset...
Loaded ubuntu: 891832 words, 6841047 QA Model creation... Initialize variables... WARNING: No previous model found, starting from clean directory: /root/ajay/DeepQA-2/save/model-ubuntufull Start training (press Ctrl+C to save and exit)...

----- Epoch 1/30 ; (lr=0.001) ----- Shuffling the dataset... Killed

How to debug this issue ? There is no error exception before it got killed! There is no issue when training on cornell/scotus ! OS: ubuntu 16.04, Hardaware: Vsphere VM with 8 Core CPU & No GPU.

julien-c commented 7 years ago

Yes the full dataset is pretty large so you need large infrastructure. But I'm going to push a flag that will let you specify a subset of the full dataset.

Conchylicultor commented 7 years ago

The program is killed by the Linux Kernel when you run out of RAM. Without GPU, you also consume more RAM due to the network.

I also had planned to add a function to only keep a fraction of the dataset but I never implemented it.

Another problem with my code is that all batches are generated at once which isn't really efficient in term of memory (the dataset is basically duplicated in memory). By using Python generator instead, it should be possible to be much more efficient.

julien-c commented 7 years ago

@AjayChoudary @Conchylicultor Here's a quickfix that restricts the size of the dataset used (sorry, it's in an unrelated PR):

https://github.com/Conchylicultor/DeepQA/pull/57/commits/3ac60103dd323eac7f838dd949b7266f42cc508b

Conchylicultor / DeepQA

Unable to train on ubuntu corpus #56