NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.5k stars 2.41k forks source link

finetune nvidia/parakeet-tdt-1.1b results out of memeory even if with lower batch size. #10085

Open sankar-mukherjee opened 1 month ago

sankar-mukherjee commented 1 month ago

I am trying to finetune nvidia/parakeet-tdt-1.1b model using the instruction below with g5.12xlarge with 4 gpus with 24gb memory.

https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/configs.html#fine-tuning-configurations

First I have created a docker container and inside the container i am running finetune.sh. I am getting OutOfMemoryError before starting the traning. I have tried reducing the batch_size = 16, 8, 4, 2, as well as max_duration of the audio files to = 20, 10 , 5 . None of them succeeds. Can anyone help me?

Docker file

# Use the specified base image
ARG FROM_IMAGE_NAME=[nvcr.io/nvidia/pytorch:24.01-py3](http://nvcr.io/nvidia/pytorch:24.01-py3)
FROM ${FROM_IMAGE_NAME}

# Set the working directory
WORKDIR /ASR

# Expose port 8000 for external communication
EXPOSE 8000

# Install system dependencies
RUN apt-get update && apt-get install -y screen libsndfile1 ffmpeg libsox-dev gfortran

# Install Cython (needed for NeMo)
RUN pip install Cython

# Clone the specified branch of the pytorch-lightning repository and install it
RUN git clone -b bug_fix https://github.com/athitten/pytorch-lightning.git && \
    cd pytorch-lightning && \
    PACKAGE_NAME=pytorch pip install -e .
RUN git clone https://github.com/NVIDIA/TransformerEngine.git && \
    cd TransformerEngine && \
    git fetch origin 8c9abbb80dba196f086b8b602a7cf1bce0040a6a && \
    git checkout FETCH_HEAD && \
    git submodule init && git submodule update && \
    NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .
# Copy and install Python dependencies from requirements.txt if necessary
COPY requirements.txt .
RUN pip install -r requirements.txt
# RUN pip uninstall -y huggingface_hub && \
#     pip install huggingface_hub==0.22.0

# Set environment variables for NVIDIA
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NEMO_CACHE_DIR /efs/smukherjee/ASR/cached_models

# Copy the rest of your application code
COPY . .

requirement.txt

nemo_toolkit[all]
transformers
huggingface-hub==0.23.2
seaborn

finetune.sh

#!/usr/bin/env bash

export HF_HOME='/efs/smukherjee/ASR/cached_models/'
export HYDRA_FULL_ERROR=1

python /efs/smukherjee/NeMo/examples/asr/speech_to_text_finetune.py \
    --config-path=/efs/smukherjee/NeMo/examples/asr/conf/asr_finetune \
    --config-name=speech_to_text_finetune \
    model.train_ds.manifest_filepath="/efs/smukherjee/ASR/data/train_finetune_dataset_raw.json" \
    model.validation_ds.manifest_filepath="/efs/smukherjee/ASR/data/val_finetune_dataset_raw.json" \
    model.train_ds.max_duration=5 \
    model.train_ds.batch_size=2 \
    model.validation_ds.batch_size=2 \
    model.tokenizer.update_tokenizer=False \
    trainer.devices=-1 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50 \
    exp_manager.exp_dir="/efs/smukherjee/ASR/output/finetune" \
    +model.joint.fused_batch_size=1 \
    +init_from_pretrained_model="nvidia/parakeet-tdt-1.1b"

Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 21.98 GiB of which 18.50 MiB is free. Process 6411 has 21.05 GiB memory in use. Process 7096 has 302.00 MiB memory in use. Process 7094 has 302.00 MiB memory in use. Process 7095 has 302.00 MiB memory in use. Of the allocated memory 20.38 GiB is allocated by PyTorch, and 219.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Epoch 0:   0%|          | 0/16093 [00:13<?, ?it/s]
nithinraok commented 3 weeks ago

Have you tried loading the 1.1b using

from nemo.collections.asr import ASRModel
model = ASRModel.from_pretrained('nvidia/parakeet-tdt-1.1b')

And see the memory usage? You would need twice this size initially as you are looking to finetune from existing model. Memory usage at this point probably would answer if you have anymore memory left to train.