Jiawch / paper-reading

1 stars 0 forks source link

readme

SPEECH

TEXT TO SPEECH

No.	Title	Thinking
1.	s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis	long-form 相关
2.	TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis	这个不大认同，不知是哪个部件 work，1) baseline 的setting is unfair, 2) for vibrations issue, 作者使用了 freq-D, 但除了mos微小的提升外，没有其他说明 vibrations 是 freq-D 解决的 3) 根据以前的经验，直接在 vocoder 的频谱做判别，并不会带来惊艳的效果，不直接
3.	Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators	对 pwg D 的改进，1) 大感受野，covers long-term variations of the harmonic component & penalizes any unwanted aperiodic noise components, 2) 小卷积核，focus on the detailed high-frequency, because the charac- teristics of the noise component vary rapidly. 3) 在我看来真正 work 的是 condition 和大的卷积核
4.	END-TO-END ADVERSARIAL TEXT-TO-SPEECH	1)DTW, 2)GAN-TTS, 3)projection embedding D,
5.	High Fidelity Speech Synthesis with Adversarial Networks	GAN-TTS
6.	A Spectral Energy Distance for Parallel Speech Synthesis	他加不同noise，我们可不可以加不同phase
7.	A Spectral Energy Distance for Parallel Speech Synthesis	STFT loss额外加了一个 repulsive term 解决电音
8.	DiffWave: A Versatile Diffusion Model for Audio Synthesis	新的生成模型
9.	EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture	这个paper 也挺有意思做alignment的, 可能可以丰富EATS上次那个DTW的做法

SINGVOIVE SYNTHESIS SYSTEM

No.	Title	Thinking
1.	Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher	tts -> svs，1）f0 和 duration 建模, 2）用 GMM 对上述两者建模比 MSE 要好？就像 Wavenet 可以用 GMM 拟合 wave, 3）domain-adaptation 对长音发挥作用
2.	Speech-to-Singing Conversion in an Encoder-Decoder Framework	codepreserves some of its characteristics (e.g., speaker identity, linguistic content), while modifying certain others (melody, phoneme durations)，1）multispeaker用什么vocoder，Griffin-Lim 2）怎么 align speech and sing？直接通过变速拉伸，不管有没有align上， 3）Silent frame removal 模块
3.	Unsupervised Singing Voice Conversion
4.	WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN	code, 1）输入和输出一样长，那一开始输入是怎么铺开到那么长的，用到 frame- wise phoneme annotations，和NPSS一样，原来WGANSING把duration，f0当成已知条件，先铺开
5.	A Combination of Model-based and Feature-based Strategy for Speech-to-Singing Alignment	alignment approaches
6.	A Dual Alignment Scheme for Improved Speech-to-Singing Voice Conversion	alignment approache
7.	A Universal Music Translation Network	code 1）the pitch of the input audio clip was changed locally
8.	Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices	Model-based STS, non-nn model
9.	Learning Singing From Speech	male，female 的转化，avg f0
10.	I2R Speech2Singing Perfects Everyone’s Singing	Rhythm correction by DTW
11.	DATA EFFICIENT VOICE CLONING FOR NEURAL SINGING SYNTHESIS	learned speaker embedding，对 new speakers 策略；finetune 策略
12.	HMM-based singing voice synthesis system using pitch-shifted pseudo training data	singing alignment, Pitch-shifted Pseudo Training
13.	A Strategy for Improved Phone-Level Lyrics-to-Audio Alignment for Speech-to-Singing Synthesis，Applying Spectral Normalisation and Efficient Envelope Estimation and Statistical Transformation for the Voice Conversion Challenge 2016	Post-processing 把 pitch scale 从 singing 变到 speech 上来，那我们可不可以反变换, phase correction model，可以帮助vocoder？

SEPARATION

No.	Title	Thinking
1.	Probabilistic Permutation Invariant Training for Speech Separation	PIT 的改进, 可以去看下 PIT 的代码
2.	Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation, Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks	PIT, utterance-level PIT
3.	IMPLICIT FILTER-AND-SUM NETWORK FOR MULTI-CHANNEL SPEECH SEPARATION	multi-channel
4.	RETHINKING THE SEPARATION LAYERS IN SPEECH SEPARATION NETWORKS	cascade, 无 teacher force，提到但未实验 speaker number 可以是
5.	Listening and Grouping: An Online Autoregressive Approach for Monaural Speech Separation	可能与cascade 相关
6.	Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation	可能与cascade 相关
7.	A comprehensive study of speech separation: spectrogram vs waveform separation	BSS 综述
8.	Research Advances and Perspectives on the Cocktail Party Problem and Related Auditory Models	鸡尾酒会综述
9.	Recursive speech separation for unknown number of speakers	unknown speaker number
10.	ONE SHOT LEARNING FOR SPEECH SEPARATION	这个是不是说明 meta-learning 放在source domain和taget domain相差不大的时候，比style transform好
11.	Serialized Output Training for End-to-End Overlapped Speech Recognition、STREAMING MULTI-SPEAKER ASR WITH RNN-T	有意思，ASR 里 SOT(first-in, first-out) 比 PIT 好，2020的方法，还没有人做
12.	Applying Spectral Normalisation and Efficient Envelope Estimation and Statistical Transformation for the Voice Conversion Challenge 2016	一个比线性回归更好的 envelope 模型？
13.	Mixup-breakdown: a consistency training method for improving generalization of speech separation models	mixup 用在separation 会怎么样

leverages additional speaker information to help spearation

information from target speaker

wavesplit: not solely optimized for identification but also for the reconstruction of separated speech wavesplit: loss 关联到 replusive loss, 简化kmeans，使其可也end2end也很妙；和vq-vae很像，有什么关联 LPC（线性预测编码）: 与文本有关

No.	Title	Thinking

TRANSFORM

No.	Title	Thinking
1.	Multiple F0 Estimation in Vocal Ensembles using Convolutional Neural Networks	F0建模，CQT变换
2.	UPSAMPLING ARTIFACTS IN NEURAL AUDIO SYNTHESIS	这篇专门分析上采样的问题

LONG-FORM

No.	Title	Thinking

Others are relate to speech

No.	Title	Thinking
1.	VOCAL MELODY EXTRACTION USING PATCH-BASED CNN	melody/F0 extractor
2.	FastPitch: Parallel Text-to-speech with Pitch Prediction	Pitch modeling, 与fs2不同，它是在铺开前预测 average pitch of every character，我好奇这个 avg pitch 的 target 是怎么得来的

Computer Vision

No.	Title	Thinking
1.	Wavelet Integrated CNNs for Noise-Robust Image Classification

Flow

No.	Title	Thinking
1.

Phase

No.	Title	Thinking
1.	GANSYNTH: ADVERSARIAL NEURAL AUDIO SYNTHESIS, LEARNING AUDIO REPRESENTATIONS VIA PHASE PREDICTION	Demo这两paper直接预测了相位
2.	Phase reconstruction based on recurrent phase unwrapping with deep neural networks	instantaneous frequency（IF）和 group delay（GD）估计phase

Transducer

No.	Title	Thinking
1.	Sequence Transduction with Recurrent Neural Networks, FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization	rnnt
2.	Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments	transducer tts

Speech Feature

No.	Title	Thinking
1.	Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks

peking opera

No.	Title	Thinking
1.	Synthesising Expressiveness in Peking Opera via Duration Informed Attention Network

Code

DDPM

Diffusion Models与深度学习