nsml submit 시 ResourceExhaustedError

sunnys-lab commented 5 years ago

-t로 테스트시에는 별다른 에러가 없는데 submit시 에러가 발생합니다.

아래 유사한 에러가 많이 발생했던데.. 같은 이유인건가요? Blog에서 확인 해 보니 금요일에 유사 에러로 수정 하셨다고 하는데 같은 문제인지도 궁금합니다.

nsml submit Sunny/ir_ph1_v2/71 250

Building docker image. It might take for a while

.Traceback (most recent call last): tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1127,32,224,224] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node batch_normalization_3/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_3/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](concatenate_2/concat, batch_normalization_3/gamma/read, batch_nor malization_3/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[{{node global_average_pooling2d_1/Mean/_785}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_832_global_average_pooling2d_1/Mean", tensor_type=DT_FLOAT , _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

..Error: Fail to get prediction result: Sunny/ir_ph1_v2/71/250 time="2019/01/06 12:20:55.793" level=fatal msg="Internal server error"

Hackoperation commented 5 years ago

안녕하세요.

https://github.com/AiHackathon2018/AI-Vision/issues/105 여기를 참고해보시면 될것같습니다.

@sunnys-lab 님의 전체 에러 메세지는 다음과 같습니다.

  File "main.py", line 73, in infer
    reference_vecs = get_feature_layer([reference_img, 0])[0]
  File "/opt/conda/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/opt/conda/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/opt/conda/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1399, in __call__
    run_metadata_ptr)
  File "/opt/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 526, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1127,32,224,224] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node batch_normalization_3/FusedBatchNorm}} = FusedBatchNorm[T=DT_FLOAT, _class=["loc:@batch_normalization_3/cond/Switch_1"], data_format="NCHW", epsilon=0.001, is_training=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](concatenate_2/concat, batch_normalization_3/gamma/read, batch_normalization_3/beta/read, batch_normalization_1/Const_4, batch_normalization_1/Const_4)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node global_average_pooling2d_1/Mean/_785}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_832_global_average_pooling2d_1/Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

sunnys-lab commented 5 years ago

@Hackoperation 감사합니다. ^^

Naver-AI-Hackathon / AI-Vision

nsml submit 시 ResourceExhaustedError #107