Closed zhouyong64 closed 3 years ago
We don't prepare such functions, but if we use ASR with the joint CTC/attention model, you can extract the token posteriorgrams from the CTC output. Note that currently, the TIMIT recipe is the only model to deal with phoneme.
@zhouyong64 If you were to train a joint phoneme-grapheme model, it is possible to get the phonetic posteriors. I have a fork of ESPnet (older version) which has some scripts to do this using a shared Encoder and two Decoders for phoneme and grapheme.
Interesting. Thanks, @sknadig for letting us know it!
Hi @zhouyong64 , since ESPnet mainly focuses on E2E ASR, it may not be so straightforward to extract the "PPG" referred to in VC research. Most PPGs are extracted from the AM of DNN/HMM ASR systems, which really is some kind of posteriorgram because we take the layer before the senone classification layer. Some recent works extract the so-called "bottleneck feature" (BNF) which is also some phonetically-rich feature from the DNN/HMM but without a direct physical interpretation like PPG. On the other hand, it would be interesting to use, for example, the encoder hidden states as a PPG-like feature as input for VC. I have investigated the use of ASR encoder in VC in a recent preprint paper of mine, but not in a PPG-VC way. It would be cool if you try it and let us know what you find! :)
@sw005320 It seems to me that token posteriorgrams are lacking a "temporal dimension". What if we used the output attention to pinpoint the location of each token, and then "broadcast"(as in numpy's broadcast) token posteriorgrams to match the temporal dimension of a normal PPG? The result would seem close to a normal PPG. What do you think?
@unilight You've already done VC with ASR encoder as in your paper, are you suggesting that I follow the method in your paper? Just to make sure I understand what you mean by "try it".
@sw005320 It seems to me that token posteriorgrams are lacking a "temporal dimension". What if we used the output attention to pinpoint the location of each token, and then "broadcast"(as in numpy's broadcast) token posteriorgrams to match the temporal dimension of a normal PPG? The result would seem close to a normal PPG. What do you think?
Our encoder-decoder usually uses a CTC branch and it actually has a (subsampled) original temporal resolution if we extract it from the CTC branch.
@unilight You've already done VC with ASR encoder as in your paper, are you suggesting that I follow the method in your paper? Just to make sure I understand what you mean by "try it".
@zhouyong64 No. What I did was that I trained a parallel VC model which is pretrained with ASR. The framework is still parallel (one-to-one). It's just that from my paper, ASR pretraining is not robust against the reduction of training data, so I wonder if this insight means that the output of the ASR encoder is not good enough, e.g. contains too much speaker info or not discriminative enough. (Though, in the paper, I did not rigorously analyze this.) PPG-VC, on the other hand, can achieve nonparallel or even any-to-one VC. The framework is totally different from parallel VC. What I encourage you to try is to use the ASR encoder output as some kind of PPG-like feature.
These differences may seem subtle but they are actually important to distinguish between and understand more about VC methods.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue is closed. Please re-open if needed.
Is there a way to compute phonetic posteriorgrams (PPG) from pretrained ASR models? It would be very helpful for tasks like voice conversion.