训练时仅计算1个epoch的结果就停止训练的问题

EvanHan09 commented 5 years ago

请问，楼主有没有遇到过在训练时python run_cnn.py train 开始后，只训练计算得到1个epoch 结果，就停止训练了？我检查了显卡的显存占用，发现没有出现内存泄露问题。继而又尝试了两种显存的分配方式，①分配了0.4的显存 ②自动适应分配。得到的结果和上面一样，均只训练一个epoch就停止了。 Configuring TensorBoard and Saver... Loading training and validation data... Time usage: 0:00:11 2019-06-03 11:40:30.224462: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties: name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71 pciBusID: 0000:01:00.0 totalMemory: 6.00GiB freeMemory: 4.89GiB 2019-06-03 11:40:30.237900: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0 2019-06-03 11:40:30.996786: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-03 11:40:31.005045: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0 2019-06-03 11:40:31.010727: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N 2019-06-03 11:40:31.015885: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2457 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5) Training and evaluating... Epoch: 1 Iter: 0, Train Loss: 2.3, Train Acc: 10.94%, Val Loss: 2.3, Val Acc: 10.02%, Time: 0:00:02 * 能给解答一下吗？

EvanHan09 commented 5 years ago

我debug发现，到下面代码第一行这里，就没有继续运行下去了，这个运行优化是选取模型优化方法吗？新手理解可能不到位？ ` session.run(model.optim, feed_dict=feed_dict) # 运行优化 total_batch += 1

        if total_batch - last_improved > require_improvement:
            # 验证集正确率长期不提升，提前结束训练
            print("No optimization for a long time, auto-stopping...")
            flag = True
            break  # 跳出循环`

gaussic commented 5 years ago

把这一段注释掉就不会停了

EvanHan09 commented 5 years ago

把这一段注释掉就不会停了

嗯呐，我后来解决了，原因是配置问题，我把CUDA驱动更新到10.0且相应tensorflow==1.12.0，就可以正常进行训练了，只是还有一个小问题，经常在运行的时候，会报错提示无法初始化： UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node Conv2D (defined at <ipython-input-1-1eec26e598ba>:22) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable_1/read)]]

gaussic commented 5 years ago

这个问题倒没有碰到过

fanruifeng commented 5 years ago

把这一段注释掉就不会停了

嗯呐，我后来解决了，原因是配置问题，我把CUDA驱动更新到10.0且相应tensorflow==1.12.0，就可以正常进行训练了，只是还有一个小问题，经常在运行的时候，会报错提示无法初始化： UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node Conv2D (defined at <ipython-input-1-1eec26e598ba>:22) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable_1/read)]]

我现在也遇到这个问题请问您解决了嘛

gaussic / text-classification-cnn-rnn

训练时仅计算1个epoch的结果就停止训练的问题 #121