Naver-AI-Hackathon / AI-Vision

67 stars 34 forks source link

OOM오류.. #181

Closed SeoGyuSik closed 5 years ago

SeoGyuSik commented 5 years ago

Informations

CLI

WEB

NSML login ID 가 무엇인가요? SeoGyuSik

문제가 발생한 세션은 어떤건가요? (bug message or screenshot) Building docker image. It might take for a while ..........Traceback (most recent call last): tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1127,64,224,224] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node block1_conv1/convolution}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](block1_conv1/convolution-0-TransposeNHWCToNCHW-LayoutOptimizer, block1_conv1/kernel/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

..Error: Fail to get prediction result: gazua/ir_ph1_v2/61/19 time="2019/01/15 21:40:15.713" level=fatal msg="Internal server error"

재현방법은 어떻게 되나요? `def infer(queries, db):

    # Query Number: 195
    # Reference(DB) Number: 1,127
    # Total (query + reference): 1,322

    queries, query_img, references, reference_img = preprocess(queries, db)

    print('test data load queries {} query_img {} references {} reference_img {}'.
          format(len(queries), len(query_img), len(references), len(reference_img)))

    queries = np.asarray(queries)
    query_img = np.asarray(query_img)
    references = np.asarray(references)
    reference_img = np.asarray(reference_img)

    query_img = query_img.astype('float32')
    query_img /= 255
    reference_img = reference_img.astype('float32')
    reference_img /= 255

    get_feature_layer = K.function([model.layers[0].input] + [K.learning_phase()], [model.layers[-2].output])

    print('inference start')

    # inference
    query_vecs = get_feature_layer([query_img, 0])[0]

    # caching db output, db inference
    db_output = './db_infer.pkl'
    if os.path.exists(db_output):
        with open(db_output, 'rb') as f:
            reference_vecs = pickle.load(f)
    else:
        batch_size = 128
        reference_vecs = get_feature_layer([reference_img, 0])[0]
        # 512 = 모델 특징벡터 아웃풋 크기
        reference_vecs = np.zeros(reference_img.shape[0], 512, dtype=np.float32)    
        endbatch = reference_img.shape[0] // batch_size
        if reference_img.shape[0] % batch_size != 0:
            endbatch += 1
        for batidx in range(endbatch):
            st = batidx * batch_size
            en = min((batidx + 1) * batch_size, reference_img.shape[0])
            xbatch = reference_img[st: en, :, :, :]
            reference_vecs[st: en, :] = get_feature_layer([xbatch, 0])[0]  
        with open(db_output, 'wb') as f:
            pickle.dump(reference_vecs, f)

else:

reference_vecs = get_feature_layer([reference_img, 0])[0]

with open(db_output, 'wb') as f:

pickle.dump(reference_vecs, f)

    # l2 normalization
    query_vecs = l2_normalize(query_vecs)
    reference_vecs = l2_normalize(reference_vecs)
    print(query_vecs)
    print(reference_vecs)
    # Calculate cosine similarity
    sim_matrix = np.dot(query_vecs, reference_vecs.T)
    print(sim_matrix)
    retrieval_results = {}

    for (i, query) in enumerate(queries):
        query = query.split('/')[-1].split('.')[0]
        sim_list = zip(references, sim_matrix[i].tolist())
        sorted_sim_list = sorted(sim_list, key=lambda x: x[1], reverse=True)

        ranked_list = [k.split('/')[-1].split('.')[0] for (k, v) in sorted_sim_list]  # ranked list

        retrieval_results[query] = ranked_list
    print('done')

    return list(zip(range(len(retrieval_results)), retrieval_results.items()))`

예상했던 동작방식은 무엇인가요? 배치를 나누면 OOM 오류가 해결될 줄 알았습니다.

제안하고 싶은 해결방법이 있나요? else: 아래부분을 수정하였는데 이제 대회가 끝나가서 submit을 못하고 끝날것 같네요 ㅜㅜ

yanghoJI commented 5 years ago

Else 문 batch size 밑에 있는 reference_vecs = get_feature_layer([reference_img, 0])[0] 부분을 지워야지 배치단위로 나눠서 실행됩니다; 밑에 부분하고 비교해보시하고 남겨논 부분이였어요 ;