finetune nvidia/parakeet-tdt-1.1b results out of memeory even if with lower batch size.

I am trying to finetune nvidia/parakeet-tdt-1.1b model using the instruction below with g5.12xlarge with 4 gpus with 24gb memory.

https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/configs.html#fine-tuning-configurations

First I have created a docker container and inside the container i am running finetune.sh. I am getting OutOfMemoryError before starting the traning. I have tried reducing the batch_size = 16, 8, 4, 2, as well as max_duration of the audio files to = 20, 10 , 5 . None of them succeeds. Can anyone help me?

Docker file

# Use the specified base image
ARG FROM_IMAGE_NAME=[nvcr.io/nvidia/pytorch:24.01-py3](http://nvcr.io/nvidia/pytorch:24.01-py3)
FROM ${FROM_IMAGE_NAME}

# Set the working directory
WORKDIR /ASR

# Expose port 8000 for external communication
EXPOSE 8000

# Install system dependencies
RUN apt-get update && apt-get install -y screen libsndfile1 ffmpeg libsox-dev gfortran

# Install Cython (needed for NeMo)
RUN pip install Cython

# Clone the specified branch of the pytorch-lightning repository and install it
RUN git clone -b bug_fix https://github.com/athitten/pytorch-lightning.git && \
    cd pytorch-lightning && \
    PACKAGE_NAME=pytorch pip install -e .
RUN git clone https://github.com/NVIDIA/TransformerEngine.git && \
    cd TransformerEngine && \
    git fetch origin 8c9abbb80dba196f086b8b602a7cf1bce0040a6a && \
    git checkout FETCH_HEAD && \
    git submodule init && git submodule update && \
    NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .
# Copy and install Python dependencies from requirements.txt if necessary
COPY requirements.txt .
RUN pip install -r requirements.txt
# RUN pip uninstall -y huggingface_hub && \
#     pip install huggingface_hub==0.22.0

# Set environment variables for NVIDIA
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NEMO_CACHE_DIR /efs/smukherjee/ASR/cached_models

# Copy the rest of your application code
COPY . .

requirement.txt

nemo_toolkit[all]
transformers
huggingface-hub==0.23.2
seaborn

finetune.sh

#!/usr/bin/env bash

export HF_HOME='/efs/smukherjee/ASR/cached_models/'
export HYDRA_FULL_ERROR=1

python /efs/smukherjee/NeMo/examples/asr/speech_to_text_finetune.py \
    --config-path=/efs/smukherjee/NeMo/examples/asr/conf/asr_finetune \
    --config-name=speech_to_text_finetune \
    model.train_ds.manifest_filepath="/efs/smukherjee/ASR/data/train_finetune_dataset_raw.json" \
    model.validation_ds.manifest_filepath="/efs/smukherjee/ASR/data/val_finetune_dataset_raw.json" \
    model.train_ds.max_duration=5 \
    model.train_ds.batch_size=2 \
    model.validation_ds.batch_size=2 \
    model.tokenizer.update_tokenizer=False \
    trainer.devices=-1 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50 \
    exp_manager.exp_dir="/efs/smukherjee/ASR/output/finetune" \
    +model.joint.fused_batch_size=1 \
    +init_from_pretrained_model="nvidia/parakeet-tdt-1.1b"

Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 21.98 GiB of which 18.50 MiB is free. Process 6411 has 21.05 GiB memory in use. Process 7096 has 302.00 MiB memory in use. Process 7094 has 302.00 MiB memory in use. Process 7095 has 302.00 MiB memory in use. Of the allocated memory 20.38 GiB is allocated by PyTorch, and 219.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Epoch 0:   0%|          | 0/16093 [00:13<?, ?it/s]

NVIDIA / NeMo

finetune nvidia/parakeet-tdt-1.1b results out of memeory even if with lower batch size. #10085