commaai / research

dataset and code for 2016 paper "Learning a Driving Simulator"
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.17k forks source link

Training transition model is too resource intensive, uses too much memory. Possible bug #27

Open kamal94 opened 8 years ago

kamal94 commented 8 years ago

After training the autoencode, i try to train the transition model as described by the same document.

using

./server.py --time 60 --batch 64

and

./train_generative_model.py transition --batch 64 --name transition

on two different tmux sessions.

Soon (a minute) after running the training command, the process is killed because my memory and swap (16 + 10 GB) are used up, and I'm still on epoch one.

Here is a dump:

/train_generative_model.py transition --batch 64 --name transition                                                                                                  [0/0]
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7085
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.58GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
T.shape:  (64, 14, 512)
Transition variables:
transition/dreamyrnn_1_W:0
transition/dreamyrnn_1_U:0
transition/dreamyrnn_1_b:0
transition/dreamyrnn_1_V:0
transition/dreamyrnn_1_ext_b:0
Epoch 1/200
Killed
EderSantana commented 8 years ago

it is super resource intensive yes. I saw elsewhere that Keras does a lot of memory leaks. I used to have a tensorflow only implementation that seemed lighter. But it was less convenient, that was why I opted for Keras in the release.

sunny1986 commented 7 years ago

@kamal94 : Were you able to resolve that issue? I am having the same problem and my train fails sometimes on epoch 1/200 or 2/200 and never goes beyond that. Any suggestions??

zhaohuaqing1993 commented 7 years ago

how do you train the train_generative_model.py autoencoder successfully ,i meet some difficuty , have to doing somehting in code?

pandamax commented 7 years ago

Have you solved this issue? I am having the same problem and my train fails sometimes on epoch 10/200 or 40/200 and never goes beyond that. Any suggestions? Traceback (most recent call last): File "./train_generative_model.py", line 168, in nb_epoch=args.epoch, verbose=1, saver=saver File "./train_generative_model.py", line 84, in train_model z, x = next(generator) File "./train_generative_model.py", line 31, in gen X = cleanup(tup) File "/home/deep-learning/research-master/models/transition.py", line 34, in cleanup X = X/127.5 - 1. MemoryError