X-LANCE / UniCATS-CTX-txt2vec

[AAAI 2024] CTX-txt2vec, the acoustic model in UniCATS
https://cpdu.github.io/unicats
57 stars 8 forks source link

使用readme的数据迭代训练400k步,出来都是噪音,可能是什么问题? #2

Closed segmentationFaults closed 8 months ago

segmentationFaults commented 8 months ago

想问下 loss 正常情况应该是什么样的呢?

segmentationFaults commented 8 months ago
image
segmentationFaults commented 8 months ago

感觉训练可能有点问题

cantabile-kwok commented 8 months ago

你好,我暂时没有看出来训练曲线的问题,我这边的loss也差不多是这个形状的,方便看一下模型输出的index吗(即feats.ark里面的内容),是否有明显不正常的pattern?

segmentationFaults commented 8 months ago

from tqdm import tqdm import kaldiio import torch from kaldiio import ReadHelper

import os from ctx_text2vec.utils.io import load_yaml_config from ctx_text2vec.modeling.build import build_model import librosa from utils.espnet_transform.spectrogram import logmelspectrogram from utils.espnet_transform.cmvn import CMVN import argparse import logging import os import time

import numpy as np import soundfile as sf import torch import yaml

from tqdm import tqdm

from ctx_vec2wav.datasets import MelSCPDataset from ctx_vec2wav.utils import load_model

device = "cuda" config = load_yaml_config('/home/work/code/unicats/UniCATS-CTX-text2vec/OUTPUT/Libritts/configs/config.yaml') model = build_model(config).to(device) ckpt = torch.load("/home/work/code/unicats/UniCATS-CTX-text2vec/OUTPUT/Libritts/checkpoint/000022e_337777iter.pth") model.load_state_dict(ckpt["model"]) model.eval()

lexicon = {} with open("/home/work/code/unicats/UniCATS-CTX-text2vec/data/libritts/lang_1phn/train_all_units.txt", 'r') as f: for line in f.readlines(): txt_token, token_id = line.strip().split() lexicon[txt_token] = int(token_id)

vqid_table = [] with open("/home/work/code/unicats/UniCATS-CTX-text2vec/feats/libritts/vqidx/label2vqidx", 'r') as f: for line in f.readlines(): line = line.strip().split() label = int(line[0]) vqid_table.append(torch.tensor(list(map(int, line[1:])))) vqid_table = torch.stack(vqid_table, dim=0).to(device)

text = "SIL1 IH0 T W UH1 D B IY0 AH0 G L UW1 M IY0 SIL1 S IY1 K R IH0 T N AY1 T SIL2" text = torch.LongTensor([lexicon[w] for w in text.split()]).unsqueeze(0).to(device) out = model.generate('top0.85r', text)['content_token'][0]

vqidx = vqid_table[out] ''' print(vqidx, vqidx.size())

with ReadHelper(f'scp:/home/work/code/unicats/UniCATS-CTX-text2vec/data/libritts/train_all/feats.scp') as reader: for key, numpy_array in reader:

if key == "1089_134686_000002_000000":

    print(key, "\n", torch.LongTensor(numpy_array), numpy_array.shape)
    vqidx = vqid_table[torch.LongTensor(numpy_array)].squeeze(1)
    break

print(vqidx, vqidx.size())

'''

if torch.cuda.is_available(): device = torch.device("cuda") logging.info("Using GPU.") else: device = torch.device("cpu") logging.info("Using CPU.")

stats = kaldiio.load_mat('/home/work/code/unicats/UniCATS-CTX-vec2wav/feats/libritts/fbank/train_all/cmvn.ark')

stats_dict = {None: stats}

cmvn = CMVN(stats=stats_dict, norm_means=True, norm_vars=True)

vec2wav_config = yaml.load(open("/home/work/code/unicats/UniCATS-CTX-vec2wav/exp/pretrained/config.yml"), Loader=yaml.Loader) model_vec2wav = load_model("/home/work/code/unicats/UniCATS-CTX-vec2wav/exp/pretrained/ctx_v2w.pkl", vec2wav_config)

model_vec2wav.backend.remove_weight_norm() model_vec2wav = model_vec2wav.eval().to(device)

feat_codebook = torch.tensor(np.load(vec2wav_config["vq_codebook"], allow_pickle=True)).to(device) # (2, 320, 256) feat_codebook_numgroups = feat_codebook.shape[0] feat_codebook = torch.nn.ModuleList([torch.nn.Embedding.from_pretrained(feat_codebook[i], freeze=True) for i in range(feat_codebook_numgroups)])

start = time.time() vqvec = torch.cat([feat_codebook[i](vqidx[:, i]) for i in range(feat_codebook_numgroups)], dim=-1).unsqueeze(0).to(device) # (1, L, 512) print(vqvec.size(), vqvec)

prompt_wav, sr = librosa.load("/data/workspace/nfs_2/work/data/tts/pidgin/occ2/select_audios_16k/231027130009695209_133353_136384.wav", sr=16000) prompt = logmelspectrogram( x=prompt_wav.T, fs=16000, n_mels=80, n_fft=1024, n_shift=160, win_length=465, window="hann", fmin=80, fmax=7600).squeeze()[None, :, :] prompt = torch.FloatTensor(prompt) prompt = cmvn(prompt).float().to(device)

with torch.no_grad(): y = model_vec2wav.inference(vqvec, prompt, normalize_before=False)[-1].view(-1)

sf.write(os.path.join("test", f"test.wav"), y.cpu().numpy(), 16000, "PCM_16")

segmentationFaults commented 8 months ago

这个是我text2vec, vec2wav合在一起的脚本,用text2vec出来的vqindx是噪音,用eval的feats能合成出语音

segmentationFaults commented 8 months ago

可能我的推理有问题?

cantabile-kwok commented 8 months ago

这样有点不太好找原因,能否先试试continuation呢?按理说continuation应该更稳定点,如果还是纯噪声可能就是推理过程有点问题

segmentationFaults commented 8 months ago

嗯,我这个inference 是从inference.py copy 过来的,我试试continual

segmentationFaults commented 8 months ago

试了下 continual 能出语音👍

segmentationFaults commented 8 months ago

想问下你们的数据准备是怎么做的?

cantabile-kwok commented 8 months ago

能出声就好,我这边inference基本也正常,但一般使用还是以continuation为主的。

数据准备是按照先自己整理了wav.scp, utt2spk, text几个文件(这里此时text还是word sequence)。然后用Kaldi跑了forced alignment,这一步之后就从word sequence变成了phone sequence,并且得到duration文件。 然后使用fairseq给的vq-wav2vec模型提取了2维的VQ index特征。随后给每个两维的pair分配一个整数label,这样在数据集中总共有两万多个VQ label。这样就可以训练了。

这个数据准备工作是很早就做的,过程也比较复杂,就没有在readme里面赘述了。(这里有类似的issue

segmentationFaults commented 8 months ago

嗯,我这个inference 是从inference.py copy 过来的,我试试continual

可能还是训练的不够,inference同样的代码,有的文本能出语音,有的都是噪音

cantabile-kwok commented 8 months ago

我也同意这个猜测,没有context的条件下inference需要更强的模型能力,可能是需要更长时间的训练才能达到的

danablend commented 8 months ago

@segmentationFaults Did you end up with a script that works?

cantabile-kwok commented 8 months ago

@danablend Ah yes, please check out the recent commits several days ago. Now the README file contains how to perform inference. His question can be summarized that randomly sampling from scratch is not that stable, while perform continuation (given the context at front) yields better results. This might be related to insufficient training.

segmentationFaults commented 8 months ago

@segmentationFaults Did you end up with a script that works?

util 1000K steps, still generate noise for some text

danablend commented 8 months ago

@segmentationFaults Wondering if you experienced the same thing: I trained for 400k steps so far (7 epochs). With continuation, it sounded better and better until 300k steps, and now the output is sounding worse (pronounciations are getting less accurate and sounds more noisy) at 400k than 250k-300k. Did you experience the same thing for this?

I am just wondering, because maybe this is fine and I should just let it keep training and it will become very good like in the paper and demo page after 1000k+?