daniel-kukiela / nmt-chatbot

NMT Chatbot
GNU General Public License v3.0
385 stars 213 forks source link

bfc_allocator #95

Closed battleg3ar closed 5 years ago

battleg3ar commented 5 years ago

firstly here is some info about the system and settings to make debugging easy gpu: gtx 1070 ti vram: 8 GB compute capability: 6.1 os: Windows 10 Ram : 16 GB Dataset used: 2018-05 reddit comments. about 10M paired rows.

So i have been testing this model for various settings. But everytime the model faces issue when it completes 5000 steps. It successfully creates output model(output_dev) after 5000 steps but, after that it goes downhill. I tried vocab_size of 100000 since i have 8 GB memory, but just for the sake of debugging i tried reducing the vocab size with no success. Also till 5000 steps nmt outputs are pretty coherent although bleu score is: 0(i think that's because my dataset is big). And after 5000 steps it just finishes all 3 epochs. I will try debugging and post solution if i found one, until then any help will be very much appreciated. Also thanks to daniel and harrison for all the efforts and awesome tutorial! Cheers! Here's the cmd dump: decoding to output model/output_dev 2018-09-10 23:43:30.298629: W T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.85GiB Current allocation summary follows. 2018-09-10 23:43:30.304406: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (256): Total Chunks: 94, Chunks in use: 94. 23.5KiB allocated for chunks. 23.5KiB in use in bin. 373B client-requested in use in bin. 2018-09-10 23:43:30.309709: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2018-09-10 23:43:30.314315: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (1024): Total Chunks: 3, Chunks in use: 3. 3.8KiB allocated for chunks. 3.8KiB in use in bin. 3.0KiB client-requested in use in bin. 2018-09-10 23:43:32.063148: I ### T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:680] Stats: Limit: 6683898676 InUse: 5902498560 MaxInUse: 6307343616 NumAllocs: 31046268 MaxAllocSize: 4520998144

2018-09-10 23:43:32.071614: W T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:279] *****___ 2018-09-10 23:43:32.075987: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1275] OP_REQUIRES failed at tensor_array_ops.cc:673 : Resource exhausted: OOM when allocating tensor with shape[281,32,115003] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Exception in thread Thread-1: Traceback (most recent call last): File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1278, in _do_call return fn(*args) File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[281,32,115003] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: dynamic_seq2seq/decoder/decoder/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[dtype=DT_FLOAT, element_shape=[?,115003], _device="/job:localhost/replica:0/task:0/device:GPU:0"](dynamic_seq2seq/decoder/decoder/TensorArray, dynamic_seq2seq/decoder/decoder/TensorArrayStack/range, dynamic_seq2seq/decoder/decoder/while/Exit_2)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

daniel-kukiela commented 5 years ago

What did you change in hparams? For testing purposesyou could set steps_per_external_eval to 1000 or even 100 in hparams section (that will force to do an evaluation after that given number of steps). You are hitting OOM, but it's interesting that it occurs somewhere after first evaluation. What OS and version of Python are you using? Bleu score will be 0, yes. It is calculated by existence of words in desired answer and nmt answer, and it's not important here, as we are doing en-en translation, and coherency is way more important.

battleg3ar commented 5 years ago

I tried 'batch_size': 64, 'override_loaded_hparams': True but with no success. I got exact same output. But what's more interesting is that i used default setting i.e with 15000 vocab_size and it crashed at very 1st iteration hitting OOM, very arbitrary indeed! I'll try setting steps_per_external_eval to 100 for testing purpose but won't that put more strain on vram resources? Just speculation i am no master. I'll try and report back.

And i am using windows 10 OS with python 3.6.2 and tensorflow 1.10.0 with latest NMT. i dont think tensorflow 1.10 is the problem here, cause latest nmt work just fine with it.

battleg3ar commented 5 years ago

Here' an update. I tried reducing beam_width to 10 and num_translations_per_input': 10 num_train_steps': 500000 It slowed down the process and yet i faced same issue. OOM after 5000 steps.

daniel-kukiela commented 5 years ago

What do you mean by newest NMT? Do you mean my fork? You should be using my fork (it's modified a bit nmt). Could you possibly try TF 1.5? I didn't test it using newer TF versions.

battleg3ar commented 5 years ago

Okay, so i made it work. The issue was batch size. My data set was of 10M. Keeping batch_size default (128) back-propagation for each iteration was very high which led to OOM. Although high batch size will give higher accuracy it'll require much more memory. So i downgraded batch_size to 16 and it worked fine with 100000 vocab. I hope i helped with this issue. Also daniel, TF 1.10.0 is working great. Performance is much better on TF 1.10.0 I am using this NMT: https://github.com/tensorflow/nmt.

Also i have another question regarding tensorboard. In your repo i checked code for embeddings metadata. There's no encoder or decoder TSV files for projector in Tensorboard. Am i missing something? This may be a stupid question but i am not able to load projector in tensorboard.

daniel-kukiela commented 5 years ago

Like i said, you have to use my NMT fork. There are some fixes and changes as well. pbtxt file for a projector is written during training data preparation. if you did remove model folder, you did remove that file as well. Also you are using out tokenizez probably, so projector won;t help much (you'll see tokens not words trhere).

battleg3ar commented 5 years ago

Oh yeah, that makes sense. Thanks. I am aware of pbtxt file. i just wanted to the projector with labels. Now that you mention yeah only tokens can be seen i suppose. I think harrison implemented it with labels. I'll see his source and see if i can manage to incorporate it in this build. Thanks for all the help!

daniel-kukiela commented 5 years ago

Sorry for late reply, i was afk for couple of days now. Harrison was using my code ;) You can still download it, and you can disable advanced tokenization by setting use_bpe to False, it'll split sentences by words.

battleg3ar commented 5 years ago

Hey daniel, it's cool! i got that working. Though some how i ended up creating output_dev for every 5000 steps, though it's not a big deal since they're small in size. The bot is getting much coherent now and is at 392000 steps. But many times when asked something it'll output random links. Got any input on this?

daniel-kukiela commented 5 years ago

Yes, that's the case for us as well. With a newest code you can add a regex for scoring that will lower score of links, also that code checks if links are valid and scores down responses containing invalid (not existing) links. The other thing is that if a model outputs a link, it does that usually in every of 20 (by default) responses). If you really want to avoid links, you should remove them from sentences before training or even just do not add sentences with url to training dataset.