caicloud / tensorflow-tutorial

Example TensorFlow codes and Caicloud TensorFlow as a Service dev environment.
2.93k stars 2.08k forks source link

第五章, mnist_train.py和mnist_eval.py同时启动报错 #59

Closed saselovejulie closed 7 years ago

saselovejulie commented 7 years ago

第五章的时候, 老师介绍了训练结果的保存, 同时mnist_eval利用训练结果进行测试集的验证, 但是我如果两个脚本同时启动就会报错. 启动任意一个没有问题, 请问是因为我的电脑只有一个GPU, 所以一起只能启动一个吗? 谢谢

saselovejulie commented 7 years ago

@perhapszzy 如果有时间能帮忙解答下吗? 感激不尽!

ScorpioCPH commented 7 years ago

报什么错?有详细的日志吗?

ScorpioCPH commented 7 years ago

@saselovejulie

perhapszzy commented 7 years ago

如果使用的是GPU版本的tensorflow,这个是会报错的,目前也没有特别好的方法来避免这个问题,一个可行的方法是通过docker

saselovejulie commented 7 years ago

@ScorpioCPH 这是错误信息: Traceback (most recent call last): File "E:\tools\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call return fn(*args) File "E:\tools\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn status, run_metadata) File "E:\tools\Python35\lib\contextlib.py", line 66, in exit next(self.gen) File "E:\tools\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(5000, 784), b.shape=(784, 500), m=5000, n=500, k=784 [[Node: layer1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_x-input_0_0/_11, layer1/weights/read)]] [[Node: Mean/_13 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_35_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "E:/wrokspace/self-project/TensorFlowDemo/mnist/mnist_optimize/mnist_eval.py", line 56, in tf.app.run() File "E:\tools\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "E:/wrokspace/self-project/TensorFlowDemo/mnist/mnist_optimize/mnist_eval.py", line 53, in main evaluate(mnist) File "E:/wrokspace/self-project/TensorFlowDemo/mnist/mnist_optimize/mnist_eval.py", line 44, in evaluate accuracy_score = sess.run(accuracy, feed_dict=validate_feed) File "E:\tools\Python35\lib\site-packages\tensorflow\python\client\session.py", line 789, in run run_metadata_ptr) File "E:\tools\Python35\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run feed_dict_string, options, run_metadata) File "E:\tools\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run target_list, options, run_metadata) File "E:\tools\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(5000, 784), b.shape=(784, 500), m=5000, n=500, k=784 [[Node: layer1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_x-input_0_0/_11, layer1/weights/read)]] [[Node: Mean/_13 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_35_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op 'layer1/MatMul', defined at: File "E:/wrokspace/self-project/TensorFlowDemo/mnist/mnist_optimize/mnist_eval.py", line 56, in tf.app.run() File "E:\tools\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "E:/wrokspace/self-project/TensorFlowDemo/mnist/mnist_optimize/mnist_eval.py", line 53, in main evaluate(mnist) File "E:/wrokspace/self-project/TensorFlowDemo/mnist/mnist_optimize/mnist_eval.py", line 23, in evaluate y = mnist_inference.inference(x, None) File "E:\wrokspace\self-project\TensorFlowDemo\mnist\mnist_optimize\mnist_inference.py", line 48, in inference layer1 = tf.nn.relu(tf.matmul(input_tensor, weights) + biases) File "E:\tools\Python35\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1816, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "E:\tools\Python35\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 1217, in _mat_mul transpose_b=transpose_b, name=name) File "E:\tools\Python35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op op_def=op_def) File "E:\tools\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op original_op=self._default_original_op, op_def=op_def) File "E:\tools\Python35\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in init self._traceback = _extract_stack()

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(5000, 784), b.shape=(784, 500), m=5000, n=500, k=784 [[Node: layer1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_x-input_0_0/_11, layer1/weights/read)]] [[Node: Mean/_13 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_35_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

saselovejulie commented 7 years ago

@perhapszzy 我查了一下资料是自己的问题, GPU同时启动2个在一个GPU运行会有问题. 谢谢你提供的方案, 有机会换到Linux试试 docker.