Closed orgicus closed 7 years ago
Hey @orgicus – Thanks for the detailed info. Did you follow along with the usage instructions? (specifically step 3 about processing the data)
Hi Matt, Thank you so much for getting in touch and sorry to take your time with this.
It might be a case of RFTM on my side 😊 Thank you for pointing me in the right direction.
I've started this yesterday:
python process_data.py -t ../Data/Ms_Pacman/Train/ ../Data/.Clips/
Currently it's Processed 2799700 clips
. I haven't passed --num-clips
so now I'm eagerly awaiting for the 5000000
counter :))
Eventually training completed and I started the avg_runner.py
script, but after a full night of number crunching my 2GB GPU ran out of RAM:
I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 56 Chunks of size 256 totalling 14.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 31 Chunks of size 512 totalling 15.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 25 Chunks of size 1024 totalling 25.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 15 Chunks of size 2048 totalling 30.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3072 totalling 3.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 4096 totalling 16.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 6912 totalling 6.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 11776 totalling 11.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 13824 totalling 54.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 7 Chunks of size 38400 totalling 262.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 55296 totalling 162.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 61440 totalling 60.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 75264 totalling 367.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 131072 totalling 256.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 192000 totalling 937.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 245248 totalling 239.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 322560 totalling 315.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 360448 totalling 352.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 376320 totalling 1.08MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 524288 totalling 1.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 589824 totalling 576.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 662272 totalling 646.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 19 Chunks of size 1179648 totalling 21.38MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1409024 totalling 1.34MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 2686976 totalling 2.56MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3225600 totalling 3.08MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 9 Chunks of size 3276800 totalling 28.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3538944 totalling 3.38MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 4194304 totalling 4.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 4718592 totalling 36.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 5111808 totalling 4.88MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 6553600 totalling 18.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 12320768 totalling 11.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 13107200 totalling 100.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 19660800 totalling 18.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 29360128 totalling 28.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 429004800 totalling 409.13MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 827952128 totalling 789.60MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 1.45GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 1587499008
InUse: 1559270656
MaxInUse: 1586930688
NumAllocs: 9986112
MaxAllocSize: 1260182528
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************xxxxxxxx************************************xxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 262.50MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[8,256,210,160]
Traceback (most recent call last):
File "avg_runner.py", line 186, in <module>
main()
File "avg_runner.py", line 182, in main
runner.train()
File "avg_runner.py", line 90, in train
self.test()
File "avg_runner.py", line 98, in test
batch, self.global_step, num_rec_out=self.num_test_rec)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 389, in test_batch
feed_dict=feed_dict)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[8,256,210,160]
[[Node: generator/scale_3/calculation/convolutions_1/Conv2D_2 = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](generator/scale_3/calculation/convolutions_1/Relu_1, generator/scale_3/setup/Variable_4/read)]]
Caused by op u'generator/scale_3/calculation/convolutions_1/Conv2D_2', defined at:
File "avg_runner.py", line 186, in <module>
main()
File "avg_runner.py", line 178, in main
runner = AVGRunner(num_steps, load_path, num_test_rec)
File "avg_runner.py", line 50, in __init__
c.SCALE_KERNEL_SIZES_G)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 48, in __init__
self.define_graph()
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 179, in define_graph
last_scale_pred_test)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 127, in calculate
preds, ws[i], [1, 1, 1, 1], padding=c.PADDING_G)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 396, in conv2d
data_format=data_format, name=name)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[8,256,210,160]
[[Node: generator/scale_3/calculation/convolutions_1/Conv2D_2 = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](generator/scale_3/calculation/convolutions_1/Relu_1, generator/scale_3/setup/Variable_4/read)]]
Is there a way to "resume" the process from just before it crashed ? :D
Hmm yeah, I trained this on 6GB GPUs, so you might need to change the batch size or some other hyperparams to get it to work on 2GB. You can load the last-saved version of your model by passing in its .ckpt
file with the -l
flag
Thank you very much for the explanations, worked like a charm! ❤️
Hi,
I've cloned the repo and grabbed your trained models in an attempt to quickly see the demo running on my computer, but I'm getting an error and I'm not 100% sure I've understood how to place the data correctly:
I'm running the script by first
cd
ing intoCode
then runningpython avg_runner.py -l ../Models/Adversarial/model.ckpt-500000
. I've added a print statement before the error line to see what the variables hold and it looks like the.Clips
folder is empty:get_train_batch c.TRAIN_DIR_CLIPS ../Data/.Clips/ c.NUM_CLIPS 0
I've double checked and that seems to be the case:
I feel I'm missing something: should I have downloaded the contents of the
.Clips
folder (if so from where ?) or should the.Clips
contents be generated ?How can I double check and make sure I'm using the examples correctly ?
I am using tensorflow version
'0.12.0'
with gpu support in a virtual environment on OSX 10.11.5 with an nVidia GeForce GT 750M (2GB VRAM), CUDA 8.0 and CuDNN 5.1 installed.The first 3 levels of the repo look like this:
Full output:
I appreciate any tips or advice you can share.
Thank you, George