bshao001 / ChatLearner

A chatbot implemented in TensorFlow based on the seq2seq model, with certain rules integrated.
Apache License 2.0
538 stars 212 forks source link

Out of memory ( Custom DataSet) #39

Closed kuberkaul closed 6 years ago

kuberkaul commented 6 years ago

@bshao001 great work! So I am using my own dataset to build vocab.txt and run training on it.

My dataset is 500 mb which produces a vocab.txt of 15 mb . As I train on this on AWS sagemaker, I am getting out of memory errors.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,1024] [[Node: dynamic_seq2seq/decoder/attention/attention_layer/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T =DT_INT32, _class=["loc:@dynamic_seq2seq/decoder/attention/attention_layer/kernel"], dtype=DT_FLOAT, seed=87654321, seed2=0, _device ="/job:localhost/replica:0/task:0/device:GPU:0"](dynamic_seq2seq/decoder/attention/attention_layer/kernel/Initializer/random_uniform /shape)]]

I am using : vCPU GPU Mem (GiB) GPU Mem (GiB) Network Performanc
ml.p2.xlarge 4 1xK80 61 12 High

You do mention that vocab.txt will affect the performance , what is the optimal dataset size/vocab size for this configuration in your opinion ?

kuberkaul commented 6 years ago

Also it seems to be utilizing only 1 GPU before failing, do I need to do anything special for it to utilize all 4 GPU's ?

bshao001 commented 6 years ago

@kuberkaul Try to reduce the batch_size and see if you can manage to train the model.

Yes, when the vocab size is too big, the existing model (1024 units and 2 layers) may not be able to accommodate it, generating a poor learned model, even if you can train it without OOM issue. A direct consequence you will see is the perplexity cannot be reduced to a reasonable value (although it is reasonably a little higher when you have a bigger vocab size and more training data, you have to try the trained the model to get a feeling of that.)

You can either increase the number of unit, say, to 1200, or even bigger, but it will require larger computing power. Or, you can try to reduce the vocab size. Refer to this script to get an idea: https://github.com/bshao001/ChatLearner/blob/master/Data/Corpus/RedditData/secondcleaner.py.

The existing implementation utilizes the computing power of the first GPU, and the memory of all GPUs.

kuberkaul commented 6 years ago

hmm, seems to fail with out of memory even with 1500 number of units and reducing the batch_size to 128 and then 64.

I can try to limit the data/vocab but I was initially trying to tweak the hparams to make it work. Let me try that.

kuberkaul commented 6 years ago

@bshao001 - So trimmed the data set to 100 mb, vocab to 4 mb and with 1 GPU this seems to go well but as its starts epoch 1 for training it fails with tensorflow.python.framework.errors_impl.InvalidArgumentError: pos 0 out of range for stringb'' at index 0

Any idea why that would be ?

full stacktrace :

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in
 _do_call
    return fn(*args)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in
 _run_fn
    status, run_metadata)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 4
73, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: pos 0 out of range for stringb'' at index 0
         [[Node: Substr = Substr[T=DT_INT32](arg0, Substr/pos, Substr/len)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,?], [?,?], [?,?], [?], [?]], output_types=[DT_INT32, DT_INT32,
DT_INT32, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
         [[Node: Size/_93 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_devi
ce="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_544_Size", tensor_type=DT_INT32, _de
vice="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
bshao001 commented 6 years ago

@kuberkaul I cannot be sure, however, it looks like it was caused by an empty string ''. Make sure you run the preprocessing correctly, and the vocab.txt file also does not contain empty string as a word.

kuberkaul commented 6 years ago

Yup that was it and reducing batch size, splitting data set into multiple files helped.

Thanks!