AbrahamSanders / seq2seq-chatbot

A sequence2sequence chatbot implementation with TensorFlow.
MIT License
99 stars 56 forks source link

Tensorflow-gpu 1.0.0 issue #18

Closed sunn-e closed 5 years ago

sunn-e commented 5 years ago

Hi, I followed udemy tutorial and it seems they followed tensorflow 1.0.0 cpu version. I have CUDA compatible gpu. I tried installing that but it seems I need to reinstall older versions of CUDA toolkit 8 and cuDNN5.1 to run train this model on tensorflow GPU. I did that and I'm still unable to train it. I tried using pip install tensorflow-gpu==1.0.0 as an alternative but it gives me error. Any idea as to what do I do. Should I port my tensorflow 1.0 code to 2.0 or anything else.

AbrahamSanders commented 5 years ago

Hi @sunn-e,

This project does not use tensorflow 1.0.0 as the Udemy course does. It is compatible with 1.5.0 minimum and any higher version along the 1.x.x line, such as 1.13.1. I am using CUDA 9, but with the latest you should be able to upgrade to CUDA 10.

sunn-e commented 5 years ago

Okay. Thanks.

sunn-e commented 5 years ago

Even on latest version of tensorflow with cuda 10, Im geting Dst tensor is not initialized error. I tried changing the rnn size and batch size but still it did not solve my problem. My training stops after around 13-14th batch of 1st epoch. Should I downgrade to cuda 9?

sunn-e commented 5 years ago

Also, I used python instead of run command. Is it oky to do so?

AbrahamSanders commented 5 years ago

That sounds like an out of memory error. The batches get progressively larger (memory-wise) later in each epoch, since the training set is sorted by question sequence length in ascending order. By the 13th batch TF is trying to allocate more space on your GPU and you get the allocation error (should also say "OOM") because TF failed to find enough free space.

If you are training on a smaller GPU, in addition to the rnn_size and batch_size you can lower training_hparams/max_question_answer_words and training_hparams/conv_history_length. Try: "max_question_answer_words": 30 "conv_history_length": 4 "rnn_size": 512 "batch_size": 32

sunn-e commented 5 years ago

Thanks. It worked. How will this affect the performance of the model.

AbrahamSanders commented 5 years ago

max_question_answer_words specifies the number of words that are included in any question or answer from the dataset when training. Anything longer than that is truncated after the last end of sentence (.!?) before the max length.

conv_history_length specifies the number of previous dialog turns that are prepended to each input question when training. More history means longer sequences which take up more memory on the GPU. Making this shorter just means that the network will recognize less dialog context for each response.

rnn_size is the number of units in the LSTM memory cell. making this smaller theoretically means that it won't be able to learn as much, but making it too high can lead to overfitting. It is unfortunate that there is no good automated metric (like accuracy) to perform cross validation on free dialog models - because of this, hyperparameter tuning for layer size, number of layers, etc. is really a manual process.

batch_size is the number of question-answer samples per batch. smaller batches mean more frequent and granular weight updates at the cost of training performance, but can help if larger batches won't fit on the GPU. Typical values range from 32-128