OOM when allocating tensor

drobertduke commented 7 years ago

I have a 12GB GPU but attempting to train anything with the default settings produces an OOM on the first epoch. I had to dial the batch_size and the dilation_depth way down before it would even start. What settings are you using when you train?

I tensorflow/core/common_runtime/bfc_allocator.cc:689]      Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 83 Chunks of size 256 totalling 20.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 512 totalling 512B
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 15 Chunks of size 1024 totalling 15.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 65536 totalling 64.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 59 Chunks of size 262144 totalling 14.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 520704 totalling 508.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 105 Chunks of size 524288 totalling 52.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 13 Chunks of size 67108864 totalling 832.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 2 Chunks of size 67174400 totalling 128.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 67239936 totalling 64.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 67371008 totalling 64.25MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 67633152 totalling 64.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 68157440 totalling 65.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 134479872 totalling 128.25MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 269484032 totalling 257.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 541065216 totalling 516.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1090519040 totalling 1.02GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 2147483648 totalling 2.00GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 2214592512 totalling 2.06GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 3726535936 totalling 3.47GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 10.68GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit:                 11715375924
InUse:                 11472467200
MaxInUse:              11473515776
NumAllocs:                     563
MaxAllocSize:           3980291328

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ****************************************************************************************xxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 2.00GiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:968] Resource exhausted: OOM when allocating tensor with shape[65536,256,32,1]

basveeling commented 7 years ago

I haven't trained with tensorflow yet but I'll look into it. In the meantime, try using theano with cnmem enabled (THEANO_FLAGS='lib.cnmem=1' KERAS_BACKEND=theano python wavenet.py)

ibab commented 7 years ago

Maybe the reason for this might be the same as for the tensorflow implementation here: https://github.com/ibab/tensorflow-wavenet/issues/4#issuecomment-247474863 (I haven't looked at how keras implements AtrousConvolution1D, though).

basveeling commented 7 years ago

I was wondering why keras was requiring the dilation values to be equal in both dimensions when using tensorflow; it uses tf.nn.atrous_conv2d. Thanks for the heads up, and nice work on the fix :)!

basveeling commented 7 years ago

I'm closing this with the assumption that this is probably fixed in tensorflow by now. If not, please let me know.

Shoshin23 commented 7 years ago

Nope. This is not fixed in Tensorflow as of yet. Im getting the EXACT same error as the OP. Trying to run using theano backend and seeing if it works.

raavianvesh commented 6 years ago

Did it work using theano backend?

meridion commented 5 years ago

I would like to let you all know it is fixed in TensorFlow 1.10. Works like a charm. I'm using the unmodified current master. (well, technically I modified a single line in dataset to make the code work in Python 3.x)

basveeling commented 5 years ago

@meridion Thanks! Would you mind sending a pull request so other users can easily benefit from your fix?

HunterHantao commented 5 years ago

This is solved in Python 2.7, tensorflow-gpu 1.8.0

basveeling / wavenet

OOM when allocating tensor #7