LazyNeMoIterator improperly duration handling for non-zero offset in nemo manifest file

FredSRichardson commented 1 month ago

Describe the bug These lines of code do not handle a cut's duration correctly when the offset is not zero: From nemo/collections/common/data/lhotse/nemo_adapters.py for the class LazyNeMoIterator:

     def __iter__(self) -> Generator[Cut, None, None]:
        for data in self.source:
            audio_path = get_full_path(str(data.pop("audio_filepath")), str(self.path))
            duration = data.pop("duration")
            offset = data.pop("offset", None)
            recording = self._create_recording(audio_path, duration, data.pop("sampling_rate", None))
            cut = recording.to_cut()
            if offset is not None:
                cut = cut.truncate(offset=offset, duration=duration, preserve_id=True)
                cut.id = f"{cut.id}-{round(offset * 1e2):06d}-{round(duration * 1e2):06d}"

The issue is that the function _create_recording() in the same file uses the duration parameter to compute the number of samples for the whole recording in the returned lhotse Recording instance. The duration in the nemo manifest is for the cut starting at the offset value within the recording. The duration of the whole recording has to be obtained from somewhere else (possibly the recording itself).

Steps/Code to reproduce bug

Run scripts/speech_recognition/estimate_duration_bins.py -b 30 manifest.json using a manifest file containing non-zero offsets for example:

{"duration": 0.5, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 123.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 124.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 125.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 126.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 127.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 128.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 129.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 130.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 131.4}
{"duration": 1.0, "audio_filepath": "/path/to/some/file.wav", "text": "hello world", "offset": 132.4}
{"duration": 5.0, "audio_filepath": "/path/to/some/other-file.wav", "text": "hello world", "offset": 4.0}
{"duration": 4.0, "audio_filepath": "/path/to/some/other-file.wav", "text": "hello world", "offset": 5.0}
{"duration": 4.0, "audio_filepath": "/path/to/some/other-file.wav", "text": "hello world", "offset": 4.0}

You'll notice that the mean duration in the output is a negative number.

Expected behavior

The duration of the whole audio file should be using when creating the Recording object in the method _create_recording() in nemo/collections/common/data/lhotse/nemo_adapters.py. The reported mean duration should not be negative.

Environment overview (please complete the following information)

Environment location: Bare-metal

Method of NeMo install:

git clone https://github.com/NVIDIA/NeMo.git
git checkout r2.0.0rc0
conda install -y -c conda-forge numba==0.57.1
pip install --editable ".[all]"
pip install huggingface-hub==0.23.2       # this fixes a bug with newer versions...

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version: Linux Ubuntu 20.04
PyTorch version: pytorch-lightning-2.3.3
Python version: 3.10.12

Additional context

Add any other context about the problem here. Example: GPU model

FredSRichardson commented 1 month ago

My fix for this was to use soundfile.info() (as you did in other parts of the same code) to extract all the meta data needed to create a Recording object (after confirming the audio file exists, the above example won't work). I cached the meta data by path string to ease up on file I/O (so it should be called only once per a unique audio file).

neehhaa06 commented 1 month ago

Hello @FredSRichardson :)

fix: #10060

I think, Fetch the full audio duration using a method like get_full_audio_duration(audio_path) and use it to create the Recording object.

Update the LazyNeMoIterator to use the full duration for the Recordingdel

FredSRichardson commented 1 month ago

@neehhaa06 The current code in nemo/collections/common/data/lhotse/nemo_adapters.py uses soundfile to avoid I/O overhead of loading the whole file. The related comment is here for the LazyNeMoTarredIterator class:

                    raw_audio = tar.extractfile(tar_info).read()
                    # Note: Lhotse has a Recording.from_bytes() utility that we won't use here because
                    #       the profiling indicated significant overhead in torchaudio ffmpeg integration
                    #       that parses full audio instead of just reading the header for WAV files.
                    # recording = lhotse.Recording.from_bytes(raw_audio, recording_id=tar_info.path)
                    meta = soundfile.info(BytesIO(raw_audio))

The meta variable above returned by soundfile.info() is very useful. It includes the sample rate, number of samples, number of channels, etc. I think this is the best thing to use to address this bug report in addition to caching the meta data using the file path string using a dict.

pzelasko commented 3 weeks ago

Thanks, it looks like there's indeed an issue. Do you mind submitting a PR with your fix?

neehhaa06 commented 3 weeks ago

Hello @pzelasko :) Are you asking me?

pzelasko commented 3 weeks ago

I was asking @FredSRichardson since he mentioned he managed to fix it locally. That said you're also welcome to contribute @neehhaa06 :)

FredSRichardson commented 3 weeks ago

I can't contribute directly unfortunately, but I did leave a fairly detailed spec of the change and it's fairly minor. Let me know if you have any questions.

pzelasko commented 3 weeks ago

Thanks, appreciate it. I added tests and a fix here https://github.com/NVIDIA/NeMo/pull/10198

NVIDIA / NeMo

LazyNeMoIterator improperly duration handling for non-zero offset in nemo manifest file #10060

I think, Fetch the full audio duration using a method like get_full_audio_duration(audio_path) and use it to create the Recording object.