capreolus-ir / capreolus

A toolkit for end-to-end neural ad hoc retrieval
https://capreolus.ai
Apache License 2.0
95 stars 32 forks source link

MonoBERT MSMARCO error #173

Closed d1shs0ap closed 3 years ago

d1shs0ap commented 3 years ago

Got the following error when I ran sbatch docs/setup/scripts/sample_slurm_script.sh, the same as the one here https://github.com/capreolus-ir/capreolus/blob/feature/msmarco_psg/docs/setup/scripts/sample_slurm_script.sh:

Limit:                     32541507584
InUse:                     32515384576
MaxInUse:                  32515417344
NumAllocs:                        5196
MaxAllocSize:                219477248
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2021-08-12 10:00:19.444772: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ****************************************************************************************************
2021-08-12 10:00:19.449786: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at softmax_op_gpu.cu.cc:217 : Resource exhausted: OOM when allocating tensor with shape[16,12,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/scratch/d1shs0ap/capreolus/capreolus/run.py", line 108, in <module>
    task_entry_function()
  File "/scratch/d1shs0ap/capreolus/capreolus/task/rerank.py", line 48, in train
    return self.rerank_run(best_search_run, self.get_results_path())
  File "/scratch/d1shs0ap/capreolus/capreolus/task/rerank.py", line 95, in rerank_run
    self.benchmark.relevance_level,
  File "/scratch/d1shs0ap/capreolus/capreolus/trainer/tensorflow.py", line 262, in train
    total_loss += distributed_train_step(x)
  File "/home/d1shs0ap/capreolus-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/home/d1shs0ap/capreolus-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/d1shs0ap/capreolus-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3024, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/home/d1shs0ap/capreolus-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1961, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/d1shs0ap/capreolus-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 596, in call
    ctx=ctx)
  File "/home/d1shs0ap/capreolus-env/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[16,12,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node keras_triplet_model/tf_bert_for_sequence_classification_1/bert/encoder/layer_._11/attention/self/Softmax_1 (defined at home/d1shs0ap/capreolus-env/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py:268) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_train_step_156578155]

Errors may have originated from an input operation.
Input Source operations connected to node keras_triplet_model/tf_bert_for_sequence_classification_1/bert/encoder/layer_._11/attention/self/Softmax_1:
 keras_triplet_model/tf_bert_for_sequence_classification_1/bert/encoder/layer_._11/attention/self/Add_1 (defined at home/d1shs0ap/capreolus-env/lib/python3.7/site-packages/transformers/models/bert/modeling_tf_bert.py:265)

Function call stack:
distributed_train_step

Is it because not enough memory is allocated to this job?

crystina-z commented 3 years ago

oh i was using the basic config in the sample slurm, sorry for the confusion. you can allocate more gpu by changing the number in this line.