NELSONZHAO / zhihu

This repo contains the source code in my personal column (https://zhuanlan.zhihu.com/zhaoyeyu), implemented using Python 3.6. Including Natural Language Processing and Computer Vision projects, such as text generation, machine translation, deep convolution GAN and other actual combat code.
https://zhuanlan.zhihu.com/zhaoyeyu
3.5k stars 2.14k forks source link

Problems running in the tensorflow GPU version #10

Open west410 opened 6 years ago

west410 commented 6 years ago

Here's my code : from sklearn.preprocessing import LabelBinarizer n_class = 10 #总共10类 lb = LabelBinarizer().fit(np.array(range(n_class))) y_train = lb.transform(y_train) y_test = lb.transform(y_test)

from sklearn.model_selection import train_test_split

train_ratio = 0.8 xtrain, x_val, ytrain, y_val = train_test_split(x_train, y_train, train_size=train_ratio, random_state=123) img_shape = x_train.shape keep_prob = 0.6 epochs=5 batchsize=64 inputs = tf.placeholder(tf.float32, [None, 32, 32, 3], name='inputs') targets = tf.placeholder(tf.float32, [None, nclass], name='targets')

This is the warning of running this code : D:\Program Files\Anaconda3\lib\site-packages\sklearn\model_selection_split.py:2010: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified. FutureWarning)

第一层卷积加池化

32 x 32 x 3 to 32 x 32 x 64

conv1 = tf.layers.conv2d(inputs_, 64, (2,2), padding='same', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.1))

32 x 32 x 64 to 16 x 16 x 64

conv1 = tf.layers.max_pooling2d(conv1, (2,2), (2,2), padding='same')

第二层卷积加池化

16 x 16 x 64 to 16 x 16 x 128

conv2 = tf.layers.conv2d(conv1, 128, (4,4), padding='same', activation=tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.1))

16 x 16 x 128 to 8 x 8 x 128

conv2 = tf.layers.max_pooling2d(conv2, (2,2), (2,2), padding='same')

重塑输出

shape = np.prod(conv2.get_shape().as_list()[1:]) conv2 = tf.reshape(conv2,[-1, shape])

第一层全连接层

8 x 8 x 128 to 1 x 1024

fc1 = tf.contrib.layers.fully_connected(conv2, 1024, activation_fn=tf.nn.relu) fc1 = tf.nn.dropout(fc1, keep_prob)

第二层全连接层

1 x 1024 to 1 x 512

fc2 = tf.contrib.layers.fully_connected(fc1, 512, activation_fn=tf.nn.relu)

logits层

1 x 512 to 1 x 10

logits_ = tf.contrib.layers.fully_connected(fc2, 10, activationfn=None) logits = tf.identity(logits, name='logits')

cost & optimizer

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_withlogits(logits=logits, labels=targets_)) optimizer = tf.train.AdamOptimizer(0.001).minimize(cost)

accuracy

correctpred = tf.equal(tf.argmax(logits, 1), tf.argmax(targets_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy') import time save_model_path='./test_cifar' count = 0 start = time.time() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for epoch in range(epochs): for batch_i in range(img_shape[0]//batch_size-1): feature_batch = xtrain[batch_i batch_size: (batch_i+1)batch_size] label_batch = ytrain[batch_i batch_size: (batch_i+1)batch_size] trainloss, = sess.run([cost, optimizer], feeddict={inputs: featurebatch, targets: label_batch})

        val_acc = sess.run(accuracy,
                           feed_dict={inputs_: x_val,
                                      targets_: y_val})

        if(count%100==0):
            print('Epoch {:>2}, Train Loss {:.4f}, Validation Accuracy {:4f} '.format(epoch + 1, train_loss, val_acc))
        count += 1

end = time.time() elapsed = end - start print ("Time taken: ", elapsed, "seconds.")

This is the error of running this code : ResourceExhaustedError Traceback (most recent call last) D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, args) 1322 try: -> 1323 return fn(args) 1324 except errors.OpError as e:

D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata) 1301 feed_dict, fetch_list, target_list, -> 1302 status, run_metadata) 1303

D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py in exit(self, type_arg, value_arg, traceback_arg) 472 compat.as_text(c_api.TF_Message(self.status.status)), --> 473 c_api.TF_GetCode(self.status.status)) 474 # Delete the underlying status object from memory otherwise it stays alive

ResourceExhaustedError: OOM when allocating tensor with shape[10000,32,32,64] [[Node: conv2d_13/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_inputs__2_0_0/_15, conv2d_12/kernel/read)]] [[Node: accuracy_6/_17 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_81_accuracy_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

ResourceExhaustedError Traceback (most recent call last)

in () 54 val_acc = sess.run(accuracy, 55 feed_dict={inputs_: x_val, ---> 56 targets_: y_val}) 57 58 if(count%100==0): D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata) 887 try: 888 result = self._run(None, fetches, feed_dict, options_ptr, --> 889 run_metadata_ptr) 890 if run_metadata: 891 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr) D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1118 if final_fetches or final_targets or (handle and feed_dict_tensor): 1119 results = self._do_run(handle, final_targets, final_fetches, -> 1120 feed_dict_tensor, options, run_metadata) 1121 else: 1122 results = [] D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1315 if handle is None: 1316 return self._do_call(_run_fn, self._session, feeds, fetches, targets, -> 1317 options, run_metadata) 1318 else: 1319 return self._do_call(_prun_fn, self._session, handle, feeds, fetches) D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args) 1334 except KeyError: 1335 pass -> 1336 raise type(e)(node_def, op, message) 1337 1338 def _extend_graph(self): ResourceExhaustedError: OOM when allocating tensor with shape[10000,32,32,64] [[Node: conv2d_13/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_inputs__2_0_0/_15, conv2d_12/kernel/read)]] [[Node: accuracy_6/_17 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_81_accuracy_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Caused by op 'conv2d_13/Conv2D', defined at: File "D:\Program Files\Anaconda3\lib\runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "D:\Program Files\Anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\Program Files\Anaconda3\lib\site-packages\ipykernel\__main__.py", line 3, in app.launch_new_instance() File "D:\Program Files\Anaconda3\lib\site-packages\traitlets\config\application.py", line 653, in launch_instance app.start() File "D:\Program Files\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 474, in start ioloop.IOLoop.instance().start() File "D:\Program Files\Anaconda3\lib\site-packages\zmq\eventloop\ioloop.py", line 162, in start super(ZMQIOLoop, self).start() File "D:\Program Files\Anaconda3\lib\site-packages\tornado\ioloop.py", line 887, in start handler_func(fd_obj, events) File "D:\Program Files\Anaconda3\lib\site-packages\tornado\stack_context.py", line 275, in null_wrapper return fn(*args, **kwargs) File "D:\Program Files\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events self._handle_recv() File "D:\Program Files\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv self._run_callback(callback, msg) File "D:\Program Files\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback callback(*args, **kwargs) File "D:\Program Files\Anaconda3\lib\site-packages\tornado\stack_context.py", line 275, in null_wrapper return fn(*args, **kwargs) File "D:\Program Files\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 276, in dispatcher return self.dispatch_shell(stream, msg) File "D:\Program Files\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 228, in dispatch_shell handler(stream, idents, msg) File "D:\Program Files\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 390, in execute_request user_expressions, allow_stdin) File "D:\Program Files\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "D:\Program Files\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 501, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) File "D:\Program Files\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2717, in run_cell interactivity=interactivity, compiler=compiler, result=result) File "D:\Program Files\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2821, in run_ast_nodes if self.run_code(code, result): File "D:\Program Files\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 4, in kernel_initializer=tf.truncated_normal_initializer(mean=0.0, stddev=0.1)) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\layers\convolutional.py", line 608, in conv2d return layer.apply(inputs) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\layers\base.py", line 671, in apply return self.__call__(inputs, *args, **kwargs) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\layers\base.py", line 575, in __call__ outputs = self.call(inputs, *args, **kwargs) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\layers\convolutional.py", line 167, in call outputs = self._convolution_op(inputs, self.kernel) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 835, in __call__ return self.conv_op(inp, filter) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 499, in __call__ return self.call(inp, filter) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 187, in __call__ name=self.name) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 630, in conv2d data_format=data_format, name=name) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op op_def=op_def) File "D:\Program Files\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,32,32,64] [[Node: conv2d_13/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_inputs__2_0_0/_15, conv2d_12/kernel/read)]] [[Node: accuracy_6/_17 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_81_accuracy_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] how should I correct it . thanks for you help! ​