How to process audio already loaded in numpy and/or torch?

Jeronymous commented 9 months ago

Thank you for this good repo!

I find that auditok works in general better than popular VAD like silero (which can have unexplained behaviour on some types of audio). I'd like to use it in my project, but I struggle to do so, because when I call the VAD, I don't have access to a wav file. The only way I found to pass the torch vector of raw audio is to use this awkward conversion:

    byte_io = io.BytesIO(bytes())
    scipy.io.wavfile.write(byte_io, SAMPLE_RATE, (audio.numpy() * 32767).astype(np.int16)) # audio is a torch tensor
    bytes_wav = byte_io.read()

    segments = auditok.split(
        bytes_wav,
        sampling_rate=SAMPLE_RATE,        # sampling frequency in Hz
        channels=1,                       # number of channels
        sample_width=2,                   # number of bytes per sample
        min_dur=min_speech_duration,      # minimum duration of a valid audio event in seconds
        max_dur=len(audio)/SAMPLE_RATE,   # maximum duration of an event
        max_silence=min_silence_duration, # maximum duration of tolerated continuous silence within an event
        energy_threshold=50,
        drop_trailing_silence=True,
    )

Is there a better way to do that?

If you want to see more, or directly comment on the related PR, it's here: https://github.com/linto-ai/whisper-timestamped/pull/78/files#diff-4d4adecf50ce8affc04f13ab7274717945dd716eb910225ff154f717e81c3b64R1791

amsehili commented 9 months ago

Hi, thanks for using this repo :)

You only need to get a byte view of the numpy array and pass it to split

data = audio.numpy().astype(np.int16).view(np.uint8) # multiply audio.numpy() by 32767 if needed
segments = auditok.split(data, other_params...)

For a more generic code you can call astype with the right parameter, depending on sample width, and flatten stereo data:

sample_width_to_numpy = {1 : np.int8, 2: np.int16, 4: np.int32}
fmt = sample_width_to_numpy[your_sample_width]
data = a.T.reshape(-1).astype(fmt)).view(np.int8)

A numpy array with stereo data is expected to have a shape = (n_channels, n_samples). If your array is of shape (n_samples, n_channels), then use just a.reshape(-1).astype(fmt)).view(np.int8) (without transpose).

If you're not sure the data is converted correctly, you can save it then listen to it:

audio = auditok.AudioRegion(data, sampling _rate=SAMPLE_RATE, sample_width=2, channels=1)
audio.save("audio.wav")

if you have pyaudio installed, just play it:

audio.play()

Use Cltr+C to stop playing long audio.

Jeronymous commented 9 months ago

Thank you for your answer :)

Neither view(np.uint8) nor view(np.int8) (both mentioned in your message) were working, because view() returns <class 'numpy.ndarray'> not bytes.

The simplest solution by looking a bit at numpy doc was: data = (audio.numpy() * 32767).astype(np.int16).tobytes()

amsehili / auditok

How to process audio already loaded in numpy and/or torch? #47