Jiawch / paper-reading

1 stars 0 forks source link

SPEECH

TEXT TO SPEECH

No. Title Thinking
1. s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis long-form 相关
2. TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis 这个不大认同,不知是哪个部件 work,1) baseline 的setting is unfair, 2) for vibrations issue, 作者使用了 freq-D, 但除了mos微小的提升外,没有其他说明 vibrations 是 freq-D 解决的 3) 根据以前的经验,直接在 vocoder 的 频谱做判别,并不会带来惊艳的效果,不直接
3. Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators 对 pwg D 的改进,1) 大感受野,covers long-term variations of the harmonic component & penalizes any unwanted aperiodic noise components, 2) 小卷积核,focus on the detailed high-frequency, because the charac- teristics of the noise component vary rapidly. 3) 在我看来真正 work 的是 condition 和 大的卷积核
4. END-TO-END ADVERSARIAL TEXT-TO-SPEECH 1)DTW, 2)GAN-TTS, 3)projection embedding D,
5. High Fidelity Speech Synthesis with Adversarial Networks GAN-TTS
6. A Spectral Energy Distance for Parallel Speech Synthesis 他加不同noise,我们可不可以加不同phase
7. A Spectral Energy Distance for Parallel Speech Synthesis STFT loss额外加了一个 repulsive term 解决电音
8. DiffWave: A Versatile Diffusion Model for Audio Synthesis 新的生成模型
9. EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture 这个paper 也挺有意思 做alignment的, 可能可以丰富EATS上次那个DTW的做法

SINGVOIVE SYNTHESIS SYSTEM

No. Title Thinking
1. Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher tts -> svs,1)f0 和 duration 建模, 2)用 GMM 对上述两者建模比 MSE 要好?就像 Wavenet 可以用 GMM 拟合 wave, 3)domain-adaptation 对长音发挥作用
2. Speech-to-Singing Conversion in an Encoder-Decoder Framework codepreserves some of its characteristics (e.g., speaker identity, linguistic content), while modifying certain others (melody, phoneme durations),1)multispeaker用什么vocoder,Griffin-Lim 2)怎么 align speech and sing?直接通过变速拉伸,不管有没有align上, 3)Silent frame removal 模块
3. Unsupervised Singing Voice Conversion
4. WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN code, 1)输入和输出一样长,那一开始输入是怎么铺开到那么长的,用到 frame- wise phoneme annotations,和NPSS一样,原来WGANSING把duration,f0当成已知条件,先铺开
5. A Combination of Model-based and Feature-based Strategy for Speech-to-Singing Alignment alignment approaches
6. A Dual Alignment Scheme for Improved Speech-to-Singing Voice Conversion alignment approache
7. A Universal Music Translation Network code 1)the pitch of the input audio clip was changed locally
8. Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices Model-based STS, non-nn model
9. Learning Singing From Speech male,female 的转化,avg f0
10. I2R Speech2Singing Perfects Everyone’s Singing Rhythm correction by DTW
11. DATA EFFICIENT VOICE CLONING FOR NEURAL SINGING SYNTHESIS learned speaker embedding,对 new speakers 策略;finetune 策略
12. HMM-based singing voice synthesis system using pitch-shifted pseudo training data singing alignment, Pitch-shifted Pseudo Training
13. A Strategy for Improved Phone-Level Lyrics-to-Audio Alignment for Speech-to-Singing SynthesisApplying Spectral Normalisation and Efficient Envelope Estimation and Statistical Transformation for the Voice Conversion Challenge 2016 Post-processing 把 pitch scale 从 singing 变到 speech 上来,那我们可不可以反变换, phase correction model,可以帮助vocoder?

SEPARATION

No. Title Thinking
1. Probabilistic Permutation Invariant Training for Speech Separation PIT 的改进, 可以去看下 PIT 的代码
2. Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation, Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks PIT, utterance-level PIT
3. IMPLICIT FILTER-AND-SUM NETWORK FOR MULTI-CHANNEL SPEECH SEPARATION multi-channel
4. RETHINKING THE SEPARATION LAYERS IN SPEECH SEPARATION NETWORKS cascade, 无 teacher force,提到但未实验 speaker number 可以是
5. Listening and Grouping: An Online Autoregressive Approach for Monaural Speech Separation 可能与cascade 相关
6. Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation 可能与cascade 相关
7. A comprehensive study of speech separation: spectrogram vs waveform separation BSS 综述
8. Research Advances and Perspectives on the Cocktail Party Problem and Related Auditory Models 鸡尾酒会综述
9. Recursive speech separation for unknown number of speakers unknown speaker number
10. ONE SHOT LEARNING FOR SPEECH SEPARATION 这个是不是说明 meta-learning 放在source domain和taget domain相差不大的时候,比style transform好
11. Serialized Output Training for End-to-End Overlapped Speech RecognitionSTREAMING MULTI-SPEAKER ASR WITH RNN-T 有意思,ASR 里 SOT(first-in, first-out) 比 PIT 好,2020的方法,还没有人做
12. Applying Spectral Normalisation and Efficient Envelope Estimation and Statistical Transformation for the Voice Conversion Challenge 2016 一个比线性回归更好的 envelope 模型?
13. Mixup-breakdown: a consistency training method for improving generalization of speech separation models mixup 用在separation 会怎么样

leverages additional speaker information to help spearation

wavesplit: not solely optimized for identification but also for the reconstruction of separated speech wavesplit: loss 关联到 replusive loss, 简化kmeans,使其可也end2end也很妙;和vq-vae很像,有什么关联 LPC(线性预测编码): 与文本有关

No. Title Thinking

TRANSFORM

No. Title Thinking
1. Multiple F0 Estimation in Vocal Ensembles using Convolutional Neural Networks F0建模,CQT变换
2. UPSAMPLING ARTIFACTS IN NEURAL AUDIO SYNTHESIS 这篇专门分析上采样的问题

LONG-FORM

No. Title Thinking

Others are relate to speech

No. Title Thinking
1. VOCAL MELODY EXTRACTION USING PATCH-BASED CNN melody/F0 extractor
2. FastPitch: Parallel Text-to-speech with Pitch Prediction Pitch modeling, 与fs2不同,它是在铺开前预测 average pitch of every character,我好奇这个 avg pitch 的 target 是怎么得来的

Computer Vision

No. Title Thinking
1. Wavelet Integrated CNNs for Noise-Robust Image Classification

Flow

No. Title Thinking
1.

Phase

No. Title Thinking
1. GANSYNTH: ADVERSARIAL NEURAL AUDIO SYNTHESIS, LEARNING AUDIO REPRESENTATIONS VIA PHASE PREDICTION Demo这两paper直接预测了相位
2. Phase reconstruction based on recurrent phase unwrapping with deep neural networks instantaneous frequency(IF) 和 group delay(GD)估计phase

Transducer

No. Title Thinking
1. Sequence Transduction with Recurrent Neural Networks, FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization rnnt
2. Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments transducer tts

Speech Feature

No. Title Thinking
1. Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks

peking opera

No. Title Thinking
1. Synthesising Expressiveness in Peking Opera via Duration Informed Attention Network

Code

HiFiSinger

DDPM

Blog

Diffusion Models与深度学习