devsisters / DQN-tensorflow

Tensorflow implementation of Human-Level Control through Deep Reinforcement Learning
MIT License
2.48k stars 763 forks source link

Segmentation fault (core dumped) | MemoryError #1

Closed LecJackS closed 8 years ago

LecJackS commented 8 years ago

"Segmentation fault (core dumped)" while trying to run it.

I have no GPU configured with tensorflow. I suspect thats the reason. Is there any way to make it work just with the CPU?

Tried a couple of flags, but they didn't work. python main.py --env_name=Breakout-v0 --is_train=True --display=True --cpu=True

carpedm20 commented 8 years ago

I seriously don't recommend to use cpu for training this but python main.py --env_name=Breakout-v0 --is_train=True --cpu=True is right command for cpu training. When the segmentation fault happens?

LecJackS commented 8 years ago

It happens at the very beginning:

:~/Desktop/DQN-tensorflow$ python main.py --env_name=Breakout-v0 --is_train=True --cpu=True
Segmentation fault (core dumped)

It doesn't print anything else.

My GPU died , so now I can only experiment with my cpu (and a lot of patience).

carpedm20 commented 8 years ago

strange.. I've never seen this error before. I tested this code in iMac which doesn't have cuda supported gpu but no segmentation fault occurred.

YigitDemirag commented 8 years ago

Same here on my GPU I get this error.

python main.py --env_name=Breakout-v0 --is_train=True I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:99] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: :/usr/local/cuda/lib64 I tensorflow/stream_executor/cuda/cuda_dnn.cc:1562] Unable to load cuDNN DSO I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dsoloader.cc:105] successfully opened CUDA library libcurand.so locally libdc1394 error: Failed to initialize libdc1394 [] GPU : 0.5000 I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 4.00GiB Free memory: 3.95GiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) [2016-05-21 13:06:01,227] Making new env: Breakout-v0 {'_save_step': 50000, '_test_step': 10000, 'action_repeat': 4, 'backend': 'tf', 'batch_size': 32, 'cnn_format': 'NCHW', 'discount': 0.99, 'display': False, 'env_name': 'Breakout-v0', 'env_type': 'simple', 'ep_end': 0.1, 'ep_end_t': 1000000, 'ep_start': 1.0, 'history_length': 4, 'learn_start': 50000.0, 'learning_rate': 0.00025, 'max_delta': 1, 'max_reward': 1.0, 'max_step': 50000000, 'memory_size': 1000000, 'min_delta': -1, 'min_reward': -1.0, 'model': 'm2', 'random_start': 30, 'scale': 10000, 'screen_height': 84, 'screen_width': 84, 'target_q_update_step': 10000, 'trainfrequency': 4} [] Loading checkpoints... [*] Load SUCCESS: checkpoints/Breakout-v0/min_delta--1/max_delta-1/history_length-4/train_frequency-4/target_q_update_step-10000/memory_size-1000000/action_repeat-4/ep_end_t-1000000/backend-tf/random_start-30/scale-10000/env_type-simple/min_reward--1.0/ep_start-1.0/screen_width-84/learn_start-50000.0/cnn_format-NCHW/learning_rate-0.00025/batch_size-32/discount-0.99/max_reward-1.0/max_step-50000000/env_name-Breakout-v0/ep_end-0.1/model-m2/screen_height-84/-16350000 49%|██████████████ | 16350000/33650000 [00:00<?, ?it/s]F tensorflow/stream_executor/cuda/cuda_dnn.cc:204] could not find cudnnCreate in cudnn DSO; dlerror: /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cudnnCreate Aborted (core dumped)

EDIT : @LecJackS Installing cudnn v4.0 with CUDA 7.5 solved my problem.

LecJackS commented 8 years ago

Installed CUDA and Cudnn as @YigitDemirag suggested. Doesn't work.

Trying to import tensorflow for another project. Core dumped.

Format and reinstall Ubuntu. Now "works", or it's close to:


$ python main.py --env_name=Breakout-v0 --is_train=True --display=True --cpu=True
 [*] GPU : 0.5000
[2016-05-23 23:34:14,279] Making new env: Breakout-v0
{'_save_step': 50000,
 '_test_step': 10000,
 'action_repeat': 4,
 'backend': 'tf',
 'batch_size': 32,
 'cnn_format': 'NHWC',
 'discount': 0.99,
 'display': True,
 'env_name': 'Breakout-v0',
 'env_type': 'simple',
 'ep_end': 0.1,
 'ep_end_t': 1000000,
 'ep_start': 1.0,
 'history_length': 4,
 'learn_start': 50000.0,
 'learning_rate': 0.00025,
 'max_delta': 1,
 'max_reward': 1.0,
 'max_step': 50000000,
 'memory_size': 1000000,
 'min_delta': -1,
 'min_reward': -1.0,
 'model': 'm2',
 'random_start': 30,
 'scale': 10000,
 'screen_height': 84,
 'screen_width': 84,
 'target_q_update_step': 10000,
 'train_frequency': 4}
Traceback (most recent call last):
  File "main.py", line 63, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "main.py", line 50, in main
    agent = Agent(config, env, sess)
  File "/home/jacks/Desktop/DQN-tensorflow/dqn/agent.py", line 22, in __init__
    self.memory = ReplayMemory(self.config, self.model_dir)
  File "/home/jacks/Desktop/DQN-tensorflow/dqn/replay_memory.py", line 18, in __init__
    self.screens = np.empty((self.memory_size, config.screen_height, config.screen_width), dtype = np.float16)
MemoryError
serialx commented 8 years ago

@LecJackS Please verify that your Tensorflow is properly installed. Execute below and see if it works:

python /usr/local/lib/python2.7/dist-packages/tensorflow/models/image/mnist/convolutional.py

It should output something like below:

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
Initialized!
Step 0 (epoch 0.00), 5.8 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Step 100 (epoch 0.12), 385.9 ms
Minibatch loss: 3.279, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 6.8%
LecJackS commented 8 years ago

@serialx , yeap, it works:

$ python /usr/local/lib/python2.7/dist-packages/tensorflow/models/image/mnist/convolutional.py

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
Initialized!
Step 0 (epoch 0.00), 7.0 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Step 100 (epoch 0.12), 394.0 ms
Minibatch loss: 3.289, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 7.0%
Step 200 (epoch 0.23), 382.6 ms
Minibatch loss: 3.427, learning rate: 0.010000
Minibatch error: 10.9%
Validation error: 3.6%
Step 3000 (epoch 3.49), 422.1 ms
Minibatch loss: 2.400, learning rate: 0.008574
Minibatch error: 1.6%
Validation error: 1.0%
Step 3100 (epoch 3.61), 401.1 ms
Minibatch loss: 2.396, learning rate: 0.008574
Minibatch error: 3.1%
Validation error: 0.9%

I have the CPU ONLY version of TF installed.

serialx commented 8 years ago

Then I think the only left option is memory full. How much memory do your machine have? This program requires at least 3GB of ram.

LecJackS commented 8 years ago

I have 8GB, but yes, that was the problem.

In config.py, I reduced the memory_size value from 100 to 10, and it started to work.

Then increased it gradually to find the maximum I can run (in my case, 60).

memory_size = 60 * scale

So that's it. Solved.

MarCnu commented 8 years ago

Same thing here, I have a laptop without a GPU and with 8GB of RAM and I got the error too. Changing the memory_size to 60 solved the issue. Also, hidding the display reduces the memory use, but you cannot see the sweet progress.

carpedm20 commented 8 years ago

@LecJackS @MarCnu Thanks for sharing useful information of CPU devices. @serialx found this but I highly recommend you to increase learning_rate from 0.00025 to 0.0025 or re-clone this project to use the feature of exponential decaying of learning rate. This will reduce the training time drastically.

LecJackS commented 8 years ago

@carpedm20 Thanks, changed to 0.0025.

I'm closing this issue because it's solved.

absudabsu commented 8 years ago

Hello. I was just wondering how the issue was resolved? Are you simply finding the bound for maximum memory?

I am not working with DQN-tensorflow, but I get a similar issue just using tensorflow when I load too many models in memory (batch-size for training network). The models occupy about 600MB of memory, but causes tensorflow to crash (core dumped), so instead I run with a batch-size occupying about 400MB. I have 32GB and the total usage never crosses 4GB, so I am confused and think its a tensorflow issue, but I could be wrong?