Closed segmentationFaults closed 8 months ago
感觉训练可能有点问题
你好,我暂时没有看出来训练曲线的问题,我这边的loss也差不多是这个形状的,方便看一下模型输出的index吗(即feats.ark里面的内容),是否有明显不正常的pattern?
from tqdm import tqdm import kaldiio import torch from kaldiio import ReadHelper
import os from ctx_text2vec.utils.io import load_yaml_config from ctx_text2vec.modeling.build import build_model import librosa from utils.espnet_transform.spectrogram import logmelspectrogram from utils.espnet_transform.cmvn import CMVN import argparse import logging import os import time
import numpy as np import soundfile as sf import torch import yaml
from tqdm import tqdm
from ctx_vec2wav.datasets import MelSCPDataset from ctx_vec2wav.utils import load_model
device = "cuda" config = load_yaml_config('/home/work/code/unicats/UniCATS-CTX-text2vec/OUTPUT/Libritts/configs/config.yaml') model = build_model(config).to(device) ckpt = torch.load("/home/work/code/unicats/UniCATS-CTX-text2vec/OUTPUT/Libritts/checkpoint/000022e_337777iter.pth") model.load_state_dict(ckpt["model"]) model.eval()
lexicon = {} with open("/home/work/code/unicats/UniCATS-CTX-text2vec/data/libritts/lang_1phn/train_all_units.txt", 'r') as f: for line in f.readlines(): txt_token, token_id = line.strip().split() lexicon[txt_token] = int(token_id)
vqid_table = [] with open("/home/work/code/unicats/UniCATS-CTX-text2vec/feats/libritts/vqidx/label2vqidx", 'r') as f: for line in f.readlines(): line = line.strip().split() label = int(line[0]) vqid_table.append(torch.tensor(list(map(int, line[1:])))) vqid_table = torch.stack(vqid_table, dim=0).to(device)
text = "SIL1 IH0 T W UH1 D B IY0 AH0 G L UW1 M IY0 SIL1 S IY1 K R IH0 T N AY1 T SIL2" text = torch.LongTensor([lexicon[w] for w in text.split()]).unsqueeze(0).to(device) out = model.generate('top0.85r', text)['content_token'][0]
vqidx = vqid_table[out] ''' print(vqidx, vqidx.size())
with ReadHelper(f'scp:/home/work/code/unicats/UniCATS-CTX-text2vec/data/libritts/train_all/feats.scp') as reader: for key, numpy_array in reader:
print(key, "\n", torch.LongTensor(numpy_array), numpy_array.shape)
vqidx = vqid_table[torch.LongTensor(numpy_array)].squeeze(1)
break
'''
if torch.cuda.is_available(): device = torch.device("cuda") logging.info("Using GPU.") else: device = torch.device("cpu") logging.info("Using CPU.")
stats = kaldiio.load_mat('/home/work/code/unicats/UniCATS-CTX-vec2wav/feats/libritts/fbank/train_all/cmvn.ark')
stats_dict = {None: stats}
cmvn = CMVN(stats=stats_dict, norm_means=True, norm_vars=True)
vec2wav_config = yaml.load(open("/home/work/code/unicats/UniCATS-CTX-vec2wav/exp/pretrained/config.yml"), Loader=yaml.Loader) model_vec2wav = load_model("/home/work/code/unicats/UniCATS-CTX-vec2wav/exp/pretrained/ctx_v2w.pkl", vec2wav_config)
model_vec2wav.backend.remove_weight_norm() model_vec2wav = model_vec2wav.eval().to(device)
feat_codebook = torch.tensor(np.load(vec2wav_config["vq_codebook"], allow_pickle=True)).to(device) # (2, 320, 256) feat_codebook_numgroups = feat_codebook.shape[0] feat_codebook = torch.nn.ModuleList([torch.nn.Embedding.from_pretrained(feat_codebook[i], freeze=True) for i in range(feat_codebook_numgroups)])
start = time.time() vqvec = torch.cat([feat_codebook[i](vqidx[:, i]) for i in range(feat_codebook_numgroups)], dim=-1).unsqueeze(0).to(device) # (1, L, 512) print(vqvec.size(), vqvec)
prompt_wav, sr = librosa.load("/data/workspace/nfs_2/work/data/tts/pidgin/occ2/select_audios_16k/231027130009695209_133353_136384.wav", sr=16000) prompt = logmelspectrogram( x=prompt_wav.T, fs=16000, n_mels=80, n_fft=1024, n_shift=160, win_length=465, window="hann", fmin=80, fmax=7600).squeeze()[None, :, :] prompt = torch.FloatTensor(prompt) prompt = cmvn(prompt).float().to(device)
with torch.no_grad(): y = model_vec2wav.inference(vqvec, prompt, normalize_before=False)[-1].view(-1)
sf.write(os.path.join("test", f"test.wav"), y.cpu().numpy(), 16000, "PCM_16")
这个是我text2vec, vec2wav合在一起的脚本,用text2vec出来的vqindx是噪音,用eval的feats能合成出语音
可能我的推理有问题?
这样有点不太好找原因,能否先试试continuation呢?按理说continuation应该更稳定点,如果还是纯噪声可能就是推理过程有点问题
嗯,我这个inference 是从inference.py copy 过来的,我试试continual
试了下 continual 能出语音👍
想问下你们的数据准备是怎么做的?
能出声就好,我这边inference基本也正常,但一般使用还是以continuation为主的。
数据准备是按照先自己整理了wav.scp, utt2spk, text几个文件(这里此时text还是word sequence)。然后用Kaldi跑了forced alignment,这一步之后就从word sequence变成了phone sequence,并且得到duration文件。 然后使用fairseq给的vq-wav2vec模型提取了2维的VQ index特征。随后给每个两维的pair分配一个整数label,这样在数据集中总共有两万多个VQ label。这样就可以训练了。
这个数据准备工作是很早就做的,过程也比较复杂,就没有在readme里面赘述了。(这里有类似的issue)
嗯,我这个inference 是从inference.py copy 过来的,我试试continual
可能还是训练的不够,inference同样的代码,有的文本能出语音,有的都是噪音
我也同意这个猜测,没有context的条件下inference需要更强的模型能力,可能是需要更长时间的训练才能达到的
@segmentationFaults Did you end up with a script that works?
@danablend Ah yes, please check out the recent commits several days ago. Now the README file contains how to perform inference. His question can be summarized that randomly sampling from scratch is not that stable, while perform continuation (given the context at front) yields better results. This might be related to insufficient training.
@segmentationFaults Did you end up with a script that works?
util 1000K steps, still generate noise for some text
@segmentationFaults Wondering if you experienced the same thing: I trained for 400k steps so far (7 epochs). With continuation, it sounded better and better until 300k steps, and now the output is sounding worse (pronounciations are getting less accurate and sounds more noisy) at 400k than 250k-300k. Did you experience the same thing for this?
I am just wondering, because maybe this is fine and I should just let it keep training and it will become very good like in the paper and demo page after 1000k+?
想问下 loss 正常情况应该是什么样的呢?