[Bug]: Slow batch inference

BornSaint commented 3 weeks ago

Project Version

3.2.3

Platform and OS Version

Windows 10 64 bits

Affected Devices

PC

Existing Issues

No response

What happened?

RTX 3060 12Gb latest drivers 64Gb RAM Ryzen 5 2400G

I just compared the inference with https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/tree/main/tools Since Applio is based on this project, i tried, but can't figure out why the inference time is too different. RVC WebUI took 0,47s to each audio in first run, but 0.15s in the next runs Total of audios: 12 Average length for each audio: 5s total time in first run: 12x0.47 = 5.64s total time in next runs: 12x = 1.8s

Otherwise in RVC WebUI tools we have a script to batch inference, which is pretty close to Applio script, but... in WebUI script it took 17s total in the same task, while Applio took 37s minimum. (average 2s each audio file)

Steps to reproduce

Run the scripts mentioned as rmvpe
Compare their time to finish
...

Expected behavior

Should be at least close to 17s total time

Attachments

No response

Screenshots or Videos

No response

Additional Information

No response

BornSaint commented 3 weeks ago

i have a suspect that rmvpe is being loaded again for each sample. I tested again even with vc.vc_multi and it's the same thing (16s), but the most waiste of time is on

2024-10-26 07:30:19 | INFO | infer.modules.vc.pipeline | Loading rmvpe model,assets/rmvpe/rmvpe.pt

after this finish the inference is almost instantly, this is the log:

['0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.', 
'0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.\n24.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.30s, infer: 0.41s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.\n24.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.30s, infer: 0.41s.\n25.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.30s, infer: 0.40s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.\n24.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.30s, infer: 0.41s.\n25.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.30s, infer: 0.40s.\n3.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.06s, infer: 0.41s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.\n24.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.30s, infer: 0.41s.\n25.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.30s, infer: 0.40s.\n3.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.06s, infer: 0.41s.\n4.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.32s, infer: 0.40s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.\n24.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.30s, infer: 0.41s.\n25.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.30s, infer: 0.40s.\n3.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.06s, infer: 0.41s.\n4.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.32s, infer: 0.40s.\n5.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.30s, infer: 0.39s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.\n24.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.30s, infer: 0.41s.\n25.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.30s, infer: 0.40s.\n3.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.06s, infer: 0.41s.\n4.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.32s, infer: 0.40s.\n5.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.30s, infer: 0.39s.\n7.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.07s, infer: 0.41s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.\n24.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.30s, infer: 0.41s.\n25.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.30s, infer: 0.40s.\n3.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.06s, infer: 0.41s.\n4.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.32s, infer: 0.40s.\n5.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.30s, infer: 0.39s.\n7.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.07s, infer: 0.41s.\n8.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.05s, infer: 0.39s.', '0.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.20s, f0: 1.57s, infer: 1.26s.\n1.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.05s, infer: 0.40s.\n2.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.31s, infer: 0.43s.\n22.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.33s, infer: 0.43s.\n23.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.31s, infer: 0.41s.\n24.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.12s, f0: 0.30s, infer: 0.41s.\n25.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.11s, f0: 0.30s, infer: 0.40s.\n3.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.06s, infer: 0.41s.\n4.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.32s, infer: 0.40s.\n5.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.09s, f0: 0.30s, infer: 0.39s.\n7.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.07s, infer: 0.41s.\n8.flac->Success.\nIndex:\nlogs\\added_teste_v2.index.\nTime:\nnpy: 0.10s, f0: 0.05s, infer: 0.39s.']

BornSaint commented 3 weeks ago

log from Applio: (I added time.time() above and below pipeline call)

Converting audio batch '.\infer_data\female'...
Detected 12 audio files for inference.
Converting audio '.\infer_data\female\0.flac'...
pipeline time: 3.274010419845581
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\0_output.wav' in 4.69 seconds.
Converting audio '.\infer_data\female\1.flac'...
pipeline time: 2.129990816116333
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\1_output.wav' in 2.21 seconds.
Converting audio '.\infer_data\female\10.flac'...
pipeline time: 2.5399980545043945
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\10_output.wav' in 2.62 seconds.
Converting audio '.\infer_data\female\11.flac'...
pipeline time: 1.717000961303711
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\11_output.wav' in 1.80 seconds.
Converting audio '.\infer_data\female\2.flac'...
pipeline time: 1.6319880485534668
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\2_output.wav' in 1.70 seconds.
Converting audio '.\infer_data\female\3.flac'...
pipeline time: 1.7249972820281982
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\3_output.wav' in 1.81 seconds.
Converting audio '.\infer_data\female\4.flac'...
pipeline time: 1.9669923782348633
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\4_output.wav' in 2.08 seconds.
Converting audio '.\infer_data\female\5.flac'...
pipeline time: 1.6509978771209717
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\5_output.wav' in 1.75 seconds.
Converting audio '.\infer_data\female\6.flac'...
pipeline time: 1.8819937705993652
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\6_output.wav' in 1.96 seconds.
Converting audio '.\infer_data\female\7.flac'...
pipeline time: 1.818995475769043
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\7_output.wav' in 1.90 seconds.
Converting audio '.\infer_data\female\8.flac'...
pipeline time: 1.8459997177124023
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\8_output.wav' in 1.94 seconds.
Converting audio '.\infer_data\female\9.flac'...
pipeline time: 1.6060001850128174
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\9_output.wav' in 1.69 seconds.
Conversion completed at '.\infer_data\female'.
Batch conversion completed in 35.10 seconds.

BornSaint commented 3 weeks ago

Converting audio batch '.\infer_data\female'... Detected 12 audio files for inference. load model time: 0.902989387512207 Converting audio '.\infer_data\female\0.flac'... pipeline time: 2.7659730911254883 Conversion completed at '.\infer_data\output\female\teste_150e_14100s\0_output.wav' in 4.13 seconds. load model time: 0.6799962520599365 Converting audio '.\infer_data\female\1.flac'... pipeline time: 2.0580081939697266 Conversion completed at '.\infer_data\output\female\teste_150e_14100s\1_output.wav' in 2.16 seconds. load model time: 0.824995756149292 Converting audio '.\infer_data\female\10.flac'... pipeline time: 2.4289965629577637 Conversion completed at '.\infer_data\output\female\teste_150e_14100s\10_output.wav' in 2.51 seconds. load model time: 0.7729971408843994 Converting audio '.\infer_data\female\11.flac'...

conversion model is being called everytime because get_vc is not being called on convert_audio_batch(), but in convert_audio(), so it is being loaded each iteration made in convert_audio_batch()

BornSaint commented 3 weeks ago

those parts are taking too long

        print('inside pipeline time (chunk 1):', time.time() - start)
        start1 = time.time()
        print(pitch_guidance)

        if pitch_guidance:
            pitch, pitchf = self.get_f0(
                "input_audio_path",  # questionable purpose of making a key for an array
                audio_pad,
                p_len,
                pitch,
                f0_method,
                filter_radius,
                hop_length,
                f0_autotune,
                f0_autotune_strength,
                inp_f0,
            )
            pitch = pitch[:p_len]
            pitchf = pitchf[:p_len]
            if self.device == "mps":
                pitchf = pitchf.astype(np.float32)
            pitch = torch.tensor(pitch, device=self.device).unsqueeze(0).long()
            pitchf = torch.tensor(pitchf, device=self.device).unsqueeze(0).float()
        print('inside pipeline time (chunk 2):', time.time() - start1)
        start2 = time.time()
        for t in opt_ts:
            t = t // self.window * self.window
            if pitch_guidance:
                audio_opt.append(
                    self.voice_conversion(
                        model,
                        net_g,
                        sid,
                        audio_pad[s : t + self.t_pad2 + self.window],
                        pitch[:, s // self.window : (t + self.t_pad2) // self.window],
                        pitchf[:, s // self.window : (t + self.t_pad2) // self.window],
                        index,
                        big_npy,
                        index_rate,
                        version,
                        protect,
                    )[self.t_pad_tgt : -self.t_pad_tgt]
                )
            else:
                audio_opt.append(
                    self.voice_conversion(
                        model,
                        net_g,
                        sid,
                        audio_pad[s : t + self.t_pad2 + self.window],
                        None,
                        None,
                        index,
                        big_npy,
                        index_rate,
                        version,
                        protect,
                    )[self.t_pad_tgt : -self.t_pad_tgt]
                )
            s = t
        if pitch_guidance:
            audio_opt.append(
                self.voice_conversion(
                    model,
                    net_g,
                    sid,
                    audio_pad[t:],
                    pitch[:, t // self.window :] if t is not None else pitch,
                    pitchf[:, t // self.window :] if t is not None else pitchf,
                    index,
                    big_npy,
                    index_rate,
                    version,
                    protect,
                )[self.t_pad_tgt : -self.t_pad_tgt]
            )
        else:
            audio_opt.append(
                self.voice_conversion(
                    model,
                    net_g,
                    sid,
                    audio_pad[t:],
                    None,
                    None,
                    index,
                    big_npy,
                    index_rate,
                    version,
                    protect,
                )[self.t_pad_tgt : -self.t_pad_tgt]
            )
        audio_opt = np.concatenate(audio_opt)
        if volume_envelope != 1:
            audio_opt = AudioProcessor.change_rms(
                audio, self.sample_rate, audio_opt, self.sample_rate, volume_envelope
            )
        print('inside pipeline time (chunk 3):', time.time() - start2)
        # start2 = time.time()
        # if resample_sr >= self.sample_rate and tgt_sr != resample_sr:
        #    audio_opt = librosa.resample(
        #        audio_opt, orig_sr=tgt_sr, target_sr=resample_sr
        #    )
        # audio_max = np.abs(audio_opt).max() / 0.99
        # max_int16 = 32768
        # if audio_max > 1:
        #    max_int16 /= audio_max
        # audio_opt = (audio_opt * 32768).astype(np.int16)
        audio_max = np.abs(audio_opt).max() / 0.99
        if audio_max > 1:
            audio_opt /= audio_max
        if pitch_guidance:
            del pitch, pitchf
        del sid
        # if torch.cuda.is_available():
        #     torch.cuda.empty_cache()
        return audio_opt

Converting audio batch '.\infer_data\female'...
Detected 12 audio files for inference.
Converting audio '.\infer_data\female\0.flac'...
inside pipeline time (chunk 1): 0.20999526977539062
True
inside pipeline time (chunk 2): 1.49900484085083
inside pipeline time (chunk 3): 0.9780011177062988
pipeline time: 2.733003616333008
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\0_output.wav' in 4.20 seconds.
Converting audio '.\infer_data\female\1.flac'...
inside pipeline time (chunk 1): 0.2650001049041748
True
inside pipeline time (chunk 2): 1.28800630569458
inside pipeline time (chunk 3): 0.6159894466400146
pipeline time: 2.2129993438720703
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\1_output.wav' in 2.31 seconds.
Converting audio '.\infer_data\female\10.flac'...
inside pipeline time (chunk 1): 0.23299407958984375
True
inside pipeline time (chunk 2): 0.8480005264282227
inside pipeline time (chunk 3): 1.028001308441162
pipeline time: 2.1519947052001953
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\10_output.wav' in 2.23 seconds.
Converting audio '.\infer_data\female\11.flac'...
inside pipeline time (chunk 1): 0.23200535774230957
True
inside pipeline time (chunk 2): 1.1539943218231201
inside pipeline time (chunk 3): 0.4559969902038574
pipeline time: 1.8839983940124512
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\11_output.wav' in 1.96 seconds.
Converting audio '.\infer_data\female\2.flac'...
inside pipeline time (chunk 1): 0.24100852012634277
True
inside pipeline time (chunk 2): 0.8179953098297119
inside pipeline time (chunk 3): 0.4420332908630371
pipeline time: 1.549999713897705
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\2_output.wav' in 1.63 seconds.
Converting audio '.\infer_data\female\3.flac'...
inside pipeline time (chunk 1): 0.23300814628601074
True
inside pipeline time (chunk 2): 0.9829943180084229
inside pipeline time (chunk 3): 0.4149973392486572
pipeline time: 1.67500638961792
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\3_output.wav' in 1.76 seconds.
Converting audio '.\infer_data\female\4.flac'...
inside pipeline time (chunk 1): 0.22200894355773926
True
inside pipeline time (chunk 2): 0.8349912166595459
inside pipeline time (chunk 3): 0.5250036716461182
pipeline time: 1.629004716873169
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\4_output.wav' in 1.73 seconds.
Converting audio '.\infer_data\female\5.flac'...
inside pipeline time (chunk 1): 0.24098443984985352
True
inside pipeline time (chunk 2): 0.8300046920776367
inside pipeline time (chunk 3): 0.4479968547821045
pipeline time: 1.5619840621948242
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\5_output.wav' in 1.66 seconds.
Converting audio '.\infer_data\female\6.flac'...
inside pipeline time (chunk 1): 0.28400182723999023
True
inside pipeline time (chunk 2): 1.0190000534057617
inside pipeline time (chunk 3): 0.45599913597106934
pipeline time: 1.8040008544921875
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\6_output.wav' in 1.88 seconds.
Converting audio '.\infer_data\female\7.flac'...
inside pipeline time (chunk 1): 0.2310011386871338
True
inside pipeline time (chunk 2): 0.8650031089782715
inside pipeline time (chunk 3): 0.45100951194763184
pipeline time: 1.600010633468628
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\7_output.wav' in 1.69 seconds.
Converting audio '.\infer_data\female\8.flac'...
inside pipeline time (chunk 1): 0.2579934597015381
True
inside pipeline time (chunk 2): 0.8949980735778809
inside pipeline time (chunk 3): 0.4499995708465576
pipeline time: 1.6529960632324219
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\8_output.wav' in 1.75 seconds.
Converting audio '.\infer_data\female\9.flac'...
inside pipeline time (chunk 1): 0.2400038242340088
True
inside pipeline time (chunk 2): 0.8899903297424316
inside pipeline time (chunk 3): 0.4620020389556885
pipeline time: 1.6359944343566895
Conversion completed at '.\infer_data\output\female\teste_150e_14100s\9_output.wav' in 1.73 seconds.
Conversion completed at '.\infer_data\female'.
Batch conversion completed in 24.53 seconds.
inference_real time: 33.89046263694763

BornSaint commented 3 weeks ago

long audio behavior:

Converting audio batch '.\infer_data\female'...
Detected 1 audio files for inference.
Converting audio '.\infer_data\female\instalacao_python.wav'...
Audio split into 338 chunks for processing.
inside pipeline time (chunk 1): 0.24299860000610352
True
inside pipeline time (chunk 2): 1.746009349822998
inside pipeline time (chunk 3): 1.2089946269989014
Converted audio chunk 1
pipeline each iteration time #infer.py (for c in chunks): 3.239001750946045
inside pipeline time (chunk 1): 0.25099658966064453
True
inside pipeline time (chunk 2): 1.2890114784240723
inside pipeline time (chunk 3): 0.7340102195739746
Converted audio chunk 2
pipeline each iteration time #infer.py (for c in chunks): 2.316999912261963
inside pipeline time (chunk 1): 0.2559957504272461
True
inside pipeline time (chunk 2): 1.3240010738372803
inside pipeline time (chunk 3): 1.0670044422149658
Converted audio chunk 3
pipeline each iteration time #infer.py (for c in chunks): 2.6869957447052
inside pipeline time (chunk 1): 0.2520010471343994
True
inside pipeline time (chunk 2): 1.0729985237121582
inside pipeline time (chunk 3): 0.1880021095275879
Converted audio chunk 4
pipeline each iteration time #infer.py (for c in chunks): 1.5529990196228027
inside pipeline time (chunk 1): 0.28700804710388184
True
inside pipeline time (chunk 2): 1.0239930152893066
inside pipeline time (chunk 3): 0.45699620246887207
Converted audio chunk 5
pipeline each iteration time #infer.py (for c in chunks): 1.810995101928711
inside pipeline time (chunk 1): 0.2460174560546875
True
inside pipeline time (chunk 2): 1.2179875373840332
inside pipeline time (chunk 3): 0.46599888801574707
Converted audio chunk 6
pipeline each iteration time #infer.py (for c in chunks): 1.9710032939910889
inside pipeline time (chunk 1): 0.24901056289672852
True
inside pipeline time (chunk 2): 1.0299901962280273
inside pipeline time (chunk 3): 0.4459953308105469
Converted audio chunk 7
pipeline each iteration time #infer.py (for c in chunks): 1.7679967880249023
inside pipeline time (chunk 1): 0.2579948902130127
True
inside pipeline time (chunk 2): 1.0690088272094727
inside pipeline time (chunk 3): 0.4759941101074219
Converted audio chunk 8
pipeline each iteration time #infer.py (for c in chunks): 1.844001054763794
inside pipeline time (chunk 1): 0.2509934902191162
True
inside pipeline time (chunk 2): 1.112001895904541
inside pipeline time (chunk 3): 0.5060062408447266
Converted audio chunk 9
pipeline each iteration time #infer.py (for c in chunks): 1.9189980030059814
inside pipeline time (chunk 1): 0.25499582290649414
True
inside pipeline time (chunk 2): 0.8560085296630859
inside pipeline time (chunk 3): 0.2059948444366455
Converted audio chunk 10
pipeline each iteration time #infer.py (for c in chunks): 1.366994857788086
inside pipeline time (chunk 1): 0.24800729751586914
True
inside pipeline time (chunk 2): 0.8599958419799805
inside pipeline time (chunk 3): 0.44099903106689453
Converted audio chunk 11
pipeline each iteration time #infer.py (for c in chunks): 1.5940062999725342
inside pipeline time (chunk 1): 0.24698877334594727
True
inside pipeline time (chunk 2): 1.0510039329528809
inside pipeline time (chunk 3): 0.45401477813720703
Converted audio chunk 12
pipeline each iteration time #infer.py (for c in chunks): 1.7959966659545898
inside pipeline time (chunk 1): 0.24700641632080078
True
inside pipeline time (chunk 2): 0.87599778175354
inside pipeline time (chunk 3): 0.4320056438446045
Converted audio chunk 13
pipeline each iteration time #infer.py (for c in chunks): 1.595991611480713
inside pipeline time (chunk 1): 0.2500138282775879
True
inside pipeline time (chunk 2): 0.8189914226531982
inside pipeline time (chunk 3): 0.4640042781829834
Converted audio chunk 14
pipeline each iteration time #infer.py (for c in chunks): 1.5820012092590332
inside pipeline time (chunk 1): 0.24498915672302246
True
inside pipeline time (chunk 2): 0.8619990348815918
inside pipeline time (chunk 3): 0.25100064277648926
Converted audio chunk 15
pipeline each iteration time #infer.py (for c in chunks): 1.4049880504608154
inside pipeline time (chunk 1): 0.25600409507751465
True
inside pipeline time (chunk 2): 0.9839873313903809
inside pipeline time (chunk 3): 0.20800209045410156
Converted audio chunk 16
pipeline each iteration time #infer.py (for c in chunks): 1.491002082824707
inside pipeline time (chunk 1): 0.24200701713562012
True
inside pipeline time (chunk 2): 1.0789964199066162
inside pipeline time (chunk 3): 0.4450047016143799
Converted audio chunk 17
pipeline each iteration time #infer.py (for c in chunks): 1.8109970092773438
inside pipeline time (chunk 1): 0.24901175498962402
True
inside pipeline time (chunk 2): 1.144986867904663
inside pipeline time (chunk 3): 0.602996826171875
Converted audio chunk 18
pipeline each iteration time #infer.py (for c in chunks): 2.0439939498901367
inside pipeline time (chunk 1): 0.26201605796813965
True
inside pipeline time (chunk 2): 1.1069881916046143
inside pipeline time (chunk 3): 0.5080296993255615
Converted audio chunk 19
pipeline each iteration time #infer.py (for c in chunks): 1.9290003776550293
inside pipeline time (chunk 1): 0.265000581741333
True
inside pipeline time (chunk 2): 0.8909873962402344
inside pipeline time (chunk 3): 0.19302916526794434
Converted audio chunk 20
pipeline each iteration time #infer.py (for c in chunks): 1.3849997520446777
inside pipeline time (chunk 1): 0.2100069522857666
True
inside pipeline time (chunk 2): 0.8689939975738525
inside pipeline time (chunk 3): 0.47200942039489746
Converted audio chunk 21
pipeline each iteration time #infer.py (for c in chunks): 1.5839977264404297
inside pipeline time (chunk 1): 0.21300816535949707
True
inside pipeline time (chunk 2): 1.0699915885925293
inside pipeline time (chunk 3): 0.4510040283203125
Converted audio chunk 22
pipeline each iteration time #infer.py (for c in chunks): 1.7850091457366943
inside pipeline time (chunk 1): 0.2500028610229492
True
inside pipeline time (chunk 2): 0.8629951477050781
inside pipeline time (chunk 3): 0.2640082836151123
Converted audio chunk 23
pipeline each iteration time #infer.py (for c in chunks): 1.410994529724121
inside pipeline time (chunk 1): 0.21599912643432617
True
inside pipeline time (chunk 2): 1.0450003147125244
inside pipeline time (chunk 3): 0.43401288986206055
Converted audio chunk 24
pipeline each iteration time #infer.py (for c in chunks): 1.7240002155303955
inside pipeline time (chunk 1): 0.20799875259399414
True
inside pipeline time (chunk 2): 1.0520002841949463
inside pipeline time (chunk 3): 0.4290001392364502
Converted audio chunk 25
pipeline each iteration time #infer.py (for c in chunks): 1.7239980697631836
inside pipeline time (chunk 1): 0.21600103378295898
True
inside pipeline time (chunk 2): 1.0550005435943604
inside pipeline time (chunk 3): 0.47301554679870605
Converted audio chunk 26
pipeline each iteration time #infer.py (for c in chunks): 1.7899997234344482
inside pipeline time (chunk 1): 0.22001338005065918
True
inside pipeline time (chunk 2): 1.2089955806732178
inside pipeline time (chunk 3): 0.4539954662322998
Converted audio chunk 27
pipeline each iteration time #infer.py (for c in chunks): 1.9170033931732178
inside pipeline time (chunk 1): 0.19199180603027344
True
inside pipeline time (chunk 2): 0.8360021114349365
inside pipeline time (chunk 3): 0.4630098342895508
Converted audio chunk 28
pipeline each iteration time #infer.py (for c in chunks): 1.523993968963623
inside pipeline time (chunk 1): 0.21499848365783691
True
inside pipeline time (chunk 2): 1.0800158977508545
inside pipeline time (chunk 3): 0.4749889373779297
Converted audio chunk 29
pipeline each iteration time #infer.py (for c in chunks): 1.8020060062408447
inside pipeline time (chunk 1): 0.20099234580993652
True
inside pipeline time (chunk 2): 0.8659980297088623
inside pipeline time (chunk 3): 0.48700475692749023
Converted audio chunk 30
pipeline each iteration time #infer.py (for c in chunks): 1.585996389389038
inside pipeline time (chunk 1): 0.20700812339782715
True
inside pipeline time (chunk 2): 1.050994873046875
inside pipeline time (chunk 3): 0.47499585151672363
Converted audio chunk 31
pipeline each iteration time #infer.py (for c in chunks): 1.766996145248413
inside pipeline time (chunk 1): 0.19699740409851074
True
inside pipeline time (chunk 2): 0.887000322341919
inside pipeline time (chunk 3): 0.20304584503173828
Converted audio chunk 32
pipeline each iteration time #infer.py (for c in chunks): 1.314002513885498
inside pipeline time (chunk 1): 0.190995454788208
True
inside pipeline time (chunk 2): 1.0600101947784424
inside pipeline time (chunk 3): 0.4879903793334961
Converted audio chunk 33
pipeline each iteration time #infer.py (for c in chunks): 1.77799391746521
inside pipeline time (chunk 1): 0.23101520538330078
True
inside pipeline time (chunk 2): 1.0909898281097412
inside pipeline time (chunk 3): 0.5270068645477295
Converted audio chunk 34
pipeline each iteration time #infer.py (for c in chunks): 1.8810007572174072
inside pipeline time (chunk 1): 0.197007417678833
True
inside pipeline time (chunk 2): 0.9069921970367432
inside pipeline time (chunk 3): 0.4550011157989502
Converted audio chunk 35
pipeline each iteration time #infer.py (for c in chunks): 1.5850012302398682
inside pipeline time (chunk 1): 0.19701528549194336
True
inside pipeline time (chunk 2): 1.0109844207763672
inside pipeline time (chunk 3): 0.481001615524292
Converted audio chunk 36
pipeline each iteration time #infer.py (for c in chunks): 1.7220029830932617
inside pipeline time (chunk 1): 0.1970055103302002
True
inside pipeline time (chunk 2): 0.820990800857544
inside pipeline time (chunk 3): 0.20700645446777344
Converted audio chunk 37
pipeline each iteration time #infer.py (for c in chunks): 1.2549960613250732
inside pipeline time (chunk 1): 0.19200968742370605

AznamirWoW commented 3 weeks ago

fixed some model loads, should be faster now

Anthonyxd22 commented 3 weeks ago

me too

blaisewf commented 3 weeks ago

5 minutes audio in 10.84 seconds on a 3060 Ti, speed is just fine

blaisewf commented 3 weeks ago

If you feel that batch infer can be improved, feel free to open PR, same for MPS

IAHispano / Applio