Confusion using the plug-and-play data

orgicus commented 7 years ago

Hi,

I've cloned the repo and grabbed your trained models in an attempt to quickly see the demo running on my computer, but I'm getting an error and I'm not 100% sure I've understood how to place the data correctly:

get_train_batch c.TRAIN_DIR_CLIPS ../Data/.Clips/ c.NUM_CLIPS 0
Traceback (most recent call last):
  File "avg_runner.py", line 186, in <module>
    main()
  File "avg_runner.py", line 182, in main
    runner.train()
  File "avg_runner.py", line 68, in train
    batch = get_train_batch()
  File "~/Adversarial_Video_Generation/Code/utils.py", line 127, in get_train_batch
    path = c.TRAIN_DIR_CLIPS + str(np.random.choice(c.NUM_CLIPS)) + '.npz'
  File "mtrand.pyx", line 1391, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:15381)
ValueError: a must be greater than 0

I'm running the script by first cding into Code then running python avg_runner.py -l ../Models/Adversarial/model.ckpt-500000. I've added a print statement before the error line to see what the variables hold and it looks like the .Clips folder is empty: get_train_batch c.TRAIN_DIR_CLIPS ../Data/.Clips/ c.NUM_CLIPS 0

I've double checked and that seems to be the case:

> file ../Data/.Clips/
../Data/.Clips/: directory
> ls ../Data/.Clips/ | wc -w
0

I feel I'm missing something: should I have downloaded the contents of the .Clips folder (if so from where ?) or should the .Clips contents be generated ?

How can I double check and make sure I'm using the examples correctly ?

I am using tensorflow version '0.12.0' with gpu support in a virtual environment on OSX 10.11.5 with an nVidia GeForce GT 750M (2GB VRAM), CUDA 8.0 and CuDNN 5.1 installed.

The first 3 levels of the repo look like this:

├── Code
│   ├── avg_runner.py
│   ├── constants.py
│   ├── constants.pyc
│   ├── d_model.py
│   ├── d_model.pyc
│   ├── d_scale_model.py
│   ├── d_scale_model.pyc
│   ├── g_model.py
│   ├── g_model.pyc
│   ├── loss_functions.py
│   ├── loss_functions.pyc
│   ├── loss_functions_test.py
│   ├── process_data.py
│   ├── tfutils.py
│   ├── tfutils.pyc
│   ├── tfutils_test.py
│   ├── utils.py
│   └── utils.pyc
├── Data
│   └── Ms_Pacman
│       ├── Test
│       └── Train
├── DataOld
│   └── Ms_Pacman
│       ├── Test
│       └── Train
├── LICENSE
├── Models
│   ├── Adversarial
│   │   ├── checkpoint
│   │   ├── model.ckpt-500000
│   │   └── model.ckpt-500000.meta
│   └── NonAdversarial
│       ├── checkpoint
│       ├── model.ckpt-1020000
│       └── model.ckpt-1020000.meta
├── Models.zip
├── Ms_Pacman.zip
├── README.md
├── Results
│   ├── Gifs
│   │   ├── 4_Comparison.gif
│   │   ├── 5_Comparison.gif
│   │   └── rainbow_NonAdv.gif
│   └── Summaries
│       ├── Adv-1
│       └── NonAdv-1
├── Save
│   ├── Images
│   │   └── Default
│   ├── Models
│   │   └── Default
│   └── Summaries
│       └── Default
└── deep_multi-scale_video_prediction_beyond_mean_square_error.pdf

Full output:

python avg_runner.py -l ../Models/Adversarial/model.ckpt-500000
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.1.dylib locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.dylib locally
c.TEST_DIR ../Data/Ms_Pacman/Test/
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] OS X does not support NUMA - returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.21GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
WARNING:tensorflow:From avg_runner.py:30 in __init__.: __init__ (from tensorflow.python.training.summary_io) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
Init discriminator...
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/d_model.py:92 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/d_model.py:93 in define_graph.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
Init generator...
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:199 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:219 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:221 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:226 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:228 in define_graph.: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:232 in define_graph.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
WARNING:tensorflow:From ~/Adversarial_Video_Generation/Code/g_model.py:233 in define_graph.: merge_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.merge.
Init variables...
Model restored from ../Models/Adversarial/model.ckpt-500000
get_train_batch c.TRAIN_DIR_CLIPS ../Data/.Clips/ c.NUM_CLIPS 0
Traceback (most recent call last):
  File "avg_runner.py", line 186, in <module>
    main()
  File "avg_runner.py", line 182, in main
    runner.train()
  File "avg_runner.py", line 68, in train
    batch = get_train_batch()
  File "~/Adversarial_Video_Generation/Code/utils.py", line 127, in get_train_batch
    path = c.TRAIN_DIR_CLIPS + str(np.random.choice(c.NUM_CLIPS)) + '.npz'
  File "mtrand.pyx", line 1391, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:15381)
ValueError: a must be greater than 0

I appreciate any tips or advice you can share.

Thank you, George

dyelax commented 7 years ago

Hey @orgicus – Thanks for the detailed info. Did you follow along with the usage instructions? (specifically step 3 about processing the data)

orgicus commented 7 years ago

Hi Matt, Thank you so much for getting in touch and sorry to take your time with this.

It might be a case of RFTM on my side 😊 Thank you for pointing me in the right direction.

I've started this yesterday:

python process_data.py -t ../Data/Ms_Pacman/Train/ ../Data/.Clips/

Currently it's Processed 2799700 clips. I haven't passed --num-clips so now I'm eagerly awaiting for the 5000000 counter :))

orgicus commented 7 years ago

Eventually training completed and I started the avg_runner.py script, but after a full night of number crunching my 2GB GPU ran out of RAM:

I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 56 Chunks of size 256 totalling 14.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 31 Chunks of size 512 totalling 15.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 25 Chunks of size 1024 totalling 25.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 15 Chunks of size 2048 totalling 30.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3072 totalling 3.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 4096 totalling 16.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 6912 totalling 6.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 11776 totalling 11.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 13824 totalling 54.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 7 Chunks of size 38400 totalling 262.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 55296 totalling 162.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 61440 totalling 60.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 75264 totalling 367.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 131072 totalling 256.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 192000 totalling 937.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 245248 totalling 239.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 322560 totalling 315.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 360448 totalling 352.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 376320 totalling 1.08MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 524288 totalling 1.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 589824 totalling 576.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 662272 totalling 646.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 19 Chunks of size 1179648 totalling 21.38MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1409024 totalling 1.34MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 2686976 totalling 2.56MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3225600 totalling 3.08MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 9 Chunks of size 3276800 totalling 28.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3538944 totalling 3.38MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 4194304 totalling 4.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 4718592 totalling 36.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 5111808 totalling 4.88MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 6553600 totalling 18.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 12320768 totalling 11.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 13107200 totalling 100.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 19660800 totalling 18.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 29360128 totalling 28.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 429004800 totalling 409.13MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 827952128 totalling 789.60MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 1.45GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                  1587499008
InUse:                  1559270656
MaxInUse:               1586930688
NumAllocs:                 9986112
MaxAllocSize:           1260182528

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************xxxxxxxx************************************xxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 262.50MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[8,256,210,160]
Traceback (most recent call last):
  File "avg_runner.py", line 186, in <module>
    main()
  File "avg_runner.py", line 182, in main
    runner.train()
  File "avg_runner.py", line 90, in train
    self.test()
  File "avg_runner.py", line 98, in test
    batch, self.global_step, num_rec_out=self.num_test_rec)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 389, in test_batch
    feed_dict=feed_dict)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[8,256,210,160]
     [[Node: generator/scale_3/calculation/convolutions_1/Conv2D_2 = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](generator/scale_3/calculation/convolutions_1/Relu_1, generator/scale_3/setup/Variable_4/read)]]

Caused by op u'generator/scale_3/calculation/convolutions_1/Conv2D_2', defined at:
  File "avg_runner.py", line 186, in <module>
    main()
  File "avg_runner.py", line 178, in main
    runner = AVGRunner(num_steps, load_path, num_test_rec)
  File "avg_runner.py", line 50, in __init__
    c.SCALE_KERNEL_SIZES_G)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 48, in __init__
    self.define_graph()
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 179, in define_graph
    last_scale_pred_test)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/Adversarial_Video_Generation/Code/g_model.py", line 127, in calculate
    preds, ws[i], [1, 1, 1, 1], padding=c.PADDING_G)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 396, in conv2d
    data_format=data_format, name=name)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/George/Downloads/Grouped/Projects/Resonate2017/workshops/ml4a/workshop/tf-venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[8,256,210,160]
     [[Node: generator/scale_3/calculation/convolutions_1/Conv2D_2 = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](generator/scale_3/calculation/convolutions_1/Relu_1, generator/scale_3/setup/Variable_4/read)]]

Is there a way to "resume" the process from just before it crashed ? :D

dyelax commented 7 years ago

Hmm yeah, I trained this on 6GB GPUs, so you might need to change the batch size or some other hyperparams to get it to work on 2GB. You can load the last-saved version of your model by passing in its .ckpt file with the -l flag

orgicus commented 7 years ago

Thank you very much for the explanations, worked like a charm! ❤️

dyelax / Adversarial_Video_Generation

Confusion using the plug-and-play data #17