Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform
How to run as cpu? #16

kdrkdrkdr commented 1 year ago

RuntimeError Traceback (most recent call last) in 5 x_tst = stn_tst.unsqueeze(0) 6 x_tst_lengths = torch.LongTensor([stn_tst.size(0)]) ----> 7 audio = net_g.infer(x_tst, x_tst_lengths, noise_scale=.667, noise_scale_w=0.8, length_scale=1/speed)[0][0,0].data.cpu().float().numpy() 8 ipd.display(ipd.Audio(audio,, normalize=False))

3 frames /content/drive/MyDrive/MB-iSTFT-VITS-multilingual/ in init(self, device, subbands, taps, cutoff_ratio, beta) 76 77 # convert to tensor ---> 78 analysis_filter = torch.from_numpy(h_analysis).float().unsqueeze(1).cuda(device) 79 synthesis_filter = torch.from_numpy(h_synthesis).float().unsqueeze(0).cuda(device) 80

RuntimeError: Invalid device, must be cuda device

I ran this code in colab, but this error occured...

%cd /content/drive/MyDrive/MB-iSTFT-VITS-multilingual

import matplotlib.pyplot as plt import IPython.display as ipd

import os import json import math import torch from torch import nn from torch.nn import functional as F from import DataLoader

import commons import utils from data_utils import TextAudioLoader, TextAudioCollate, TextAudioSpeakerLoader, TextAudioSpeakerCollate from models import SynthesizerTrn from text.symbols import symbols from text import text_to_sequence

from import write

def get_text(text, hps): text_norm = text_to_sequence(text, if text_norm = commons.intersperse(text_norm, 0) text_norm = torch.LongTensor(text_norm) return text_norm

hps = utils.get_hparams_from_file("./configs/arona.json")

net_g = SynthesizerTrn( len(symbols), // 2 + 1, hps.train.segment_size //, **hps.model) = net_g.eval()

_ = utils.load_checkpoint("./logs/arona/G_8000.pth", net_g, None)

text = 'こんにちは' speed = 1 stn_tst = get_text(text, hps) with torch.no_grad(): x_tst = stn_tst.unsqueeze(0) x_tst_lengths = torch.LongTensor([stn_tst.size(0)]) audio = net_g.infer(x_tst, x_tst_lengths, noise_scale=.667, noise_scale_w=0.8, length_scale=1/speed)[0][0,0].data.cpu().float().numpy() ipd.display(ipd.Audio(audio,, normalize=False))

kdrkdrkdr commented 1 year ago

and my config.json is here.

Aliraheem commented 1 year ago

Have you solved it?

leminhnguyen commented 1 year ago

@kdrkdrkdr change cuda(device) to to(device)

kdrkdrkdr commented 1 year ago

@Aliraheem It Solved.

  1. In,

y_mb_hat = F.conv_transpose1d(y_mb_hat, self.updown_filter.cuda(x.device) * self.subbands, stride=self.subbands)


y_mb_hat = F.conv_transpose1d(y_mb_hat, * self.subbands, stride=self.subbands)

  1. In analysis_filter = torch.from_numpy(h_analysis).float().unsqueeze(1).to(device)#.cuda(device) synthesis_filter = torch.from_numpy(h_synthesis).float().unsqueeze(0).to(device)#.cuda(device)

    # register coefficients as beffer
    self.register_buffer("analysis_filter", analysis_filter)
    self.register_buffer("synthesis_filter", synthesis_filter)
    # filter for downsampling & upsampling
    updown_filter = torch.zeros((subbands, subbands, subbands)).float().to(device)#.cuda(device)