INFERENCE_RAM setting not preventing CUDA OOM

zero1zero commented 3 weeks ago

I'm attempting to use marker as part of my Spark job and having some trouble getting it to not CUDA OOM. Here is my code:

model_list = None

def convert_markdown(fname: str, output: str):
  global model_list
  if model_list is None:
    settings.INFERENCE_RAM = 15
    model_list = load_all_models()

    for model in model_list:
        if model is None:
            continue
        model.share_memory()

  full_text, images, out_meta = convert_single_pdf(fname, model_list, max_pages=None, langs=None, batch_multiplier=1, start_page=None)

  fname = os.path.basename(fname)
  return save_markdown(output, fname, full_text, images, out_meta)

convert_markdown runs multiple times in a single thread with load_all_models() executing only once. Should marker respect INFERENCE_RAM or are there other settings that need to be adjusted to get it to stay within my VRAM limits?

For context, I'm executing in Google Collab using a T4 GPU.

Exception is below:

File "/usr/local/lib/python3.10/dist-packages/marker/convert.py", line 90, in convert_single_pdf
    pages, ocr_stats = run_ocr(doc, pages, langs, ocr_model, batch_multiplier=batch_multiplier)
  File "/usr/local/lib/python3.10/dist-packages/marker/ocr/recognition.py", line 51, in run_ocr
    new_pages = surya_recognition(doc, ocr_idxs, langs, rec_model, pages, batch_multiplier=batch_multiplier)
  File "/usr/local/lib/python3.10/dist-packages/marker/ocr/recognition.py", line 76, in surya_recognition
    results = run_recognition(images, surya_langs, rec_model, processor, polygons=polygons, batch_size=int(get_batch_size() * batch_multiplier))
  File "/usr/local/lib/python3.10/dist-packages/surya/ocr.py", line 30, in run_recognition
    rec_predictions, _ = batch_recognition(all_slices, all_langs, rec_model, rec_processor, batch_size=batch_size)
  File "/usr/local/lib/python3.10/dist-packages/surya/recognition.py", line 138, in batch_recognition
    return_dict = model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py", line 587, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/surya/model/recognition/encoder.py", line 439, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/surya/model/recognition/encoder.py", line 350, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/surya/model/recognition/encoder.py", line 270, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/surya/model/recognition/encoder.py", line 201, in forward
    attention_outputs = self.attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/donut/modeling_donut_swin.py", line 477, in forward
    self_outputs = self.self(hidden_states, attention_mask, head_mask, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/donut/modeling_donut_swin.py", line 388, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU

aniketinamdar commented 3 weeks ago

What if you manually override batch_multiplier and reduce it so that it stays within VRAM limits. This might slow down but can stay within the VRAM limit.

zero1zero commented 3 weeks ago

What if you manually override batch_multiplier and reduce it so that it stays within VRAM limits. This might slow down but can stay within the VRAM limit.

I set it explicitly to 1 as a parameter to convert_single_pdf above, should that have the same effect?

I also tried setting the surya batch size to as low as I can but that didn't seem to have an effect: https://github.com/VikParuchuri/marker/blob/master/marker/settings.py#L41

zero1zero commented 3 weeks ago

Alright this looks to be an effect of using Spark udf manipulation that was errantly calling load_models twice. The code above looks to be working as expected now using INFERENCE_RAM.

VikParuchuri / marker

INFERENCE_RAM setting not preventing CUDA OOM #189