Closed kuberkaul closed 6 years ago
Also it seems to be utilizing only 1 GPU before failing, do I need to do anything special for it to utilize all 4 GPU's ?
@kuberkaul Try to reduce the batch_size and see if you can manage to train the model.
Yes, when the vocab size is too big, the existing model (1024 units and 2 layers) may not be able to accommodate it, generating a poor learned model, even if you can train it without OOM issue. A direct consequence you will see is the perplexity cannot be reduced to a reasonable value (although it is reasonably a little higher when you have a bigger vocab size and more training data, you have to try the trained the model to get a feeling of that.)
You can either increase the number of unit, say, to 1200, or even bigger, but it will require larger computing power. Or, you can try to reduce the vocab size. Refer to this script to get an idea: https://github.com/bshao001/ChatLearner/blob/master/Data/Corpus/RedditData/secondcleaner.py.
The existing implementation utilizes the computing power of the first GPU, and the memory of all GPUs.
hmm, seems to fail with out of memory even with 1500 number of units and reducing the batch_size to 128 and then 64.
I can try to limit the data/vocab but I was initially trying to tweak the hparams to make it work. Let me try that.
@bshao001 - So trimmed the data set to 100 mb, vocab to 4 mb and with 1 GPU this seems to go well but as its starts epoch 1 for training it fails with
tensorflow.python.framework.errors_impl.InvalidArgumentError: pos 0 out of range for stringb'' at index 0
Any idea why that would be ?
full stacktrace :
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in
_do_call
return fn(*args)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in
_run_fn
status, run_metadata)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 4
73, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: pos 0 out of range for stringb'' at index 0
[[Node: Substr = Substr[T=DT_INT32](arg0, Substr/pos, Substr/len)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,?], [?,?], [?,?], [?], [?]], output_types=[DT_INT32, DT_INT32,
DT_INT32, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
[[Node: Size/_93 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_devi
ce="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_544_Size", tensor_type=DT_INT32, _de
vice="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
@kuberkaul I cannot be sure, however, it looks like it was caused by an empty string ''. Make sure you run the preprocessing correctly, and the vocab.txt file also does not contain empty string as a word.
Yup that was it and reducing batch size, splitting data set into multiple files helped.
Thanks!
@bshao001 great work! So I am using my own dataset to build vocab.txt and run training on it.
My dataset is 500 mb which produces a vocab.txt of 15 mb . As I train on this on AWS sagemaker, I am getting out of memory errors.
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,1024] [[Node: dynamic_seq2seq/decoder/attention/attention_layer/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T =DT_INT32, _class=["loc:@dynamic_seq2seq/decoder/attention/attention_layer/kernel"], dtype=DT_FLOAT, seed=87654321, seed2=0, _device ="/job:localhost/replica:0/task:0/device:GPU:0"](dynamic_seq2seq/decoder/attention/attention_layer/kernel/Initializer/random_uniform /shape)]]
You do mention that vocab.txt will affect the performance , what is the optimal dataset size/vocab size for this configuration in your opinion ?