Closed m-toman closed 2 years ago
Hi, thx for the hint - I have actually skipped through the paper after I finished the implementation - there is a lot of overlap with ForwardTacotron - it could definitely be worth a try and could help for sure. As for the autoregressive ForwardTacotron, it worked but I found that it exhibits lower mel quality (I didnt do an exhaustive test though) - main problem was (probably) that I trained with teacher enforcing and thus got a very low loss very quickly. With additional pre-nets and dropout as in the Tacotron the quality improved slightly, but was still lower than the non-autoreressive model.
Hi, thx for the hint - I have actually skipped through the paper after I finished the implementation - there is a lot of overlap with ForwardTacotron - it could definitely be worth a try and could help for sure. As for the autoregressive ForwardTacotron, it worked but I found that it exhibits lower mel quality (I didnt do an exhaustive test though) - main problem was (probably) that I trained with teacher enforcing and thus got a very low loss very quickly. With additional pre-nets and dropout as in the Tacotron the quality improved slightly, but was still lower than the non-autoreressive model.
Thanks, having read this paper recently https://arxiv.org/abs/1909.01145 I come to think that autoregression hurts more than it helps ;). Furthermore considering that in the Tencent paper above they found that the power of Taco does not seem to come from attention but from the pre/postnets it's not surprising you ended up with this model.
I think I'll try the length regulator from your repo in a Taco2 setting (as I see it you use the Taco1 CBHG etc. layers) and see how it goes. Also might make sense to use a classical forced aligner instead of training vanilla Taco first just for the alignments. I'll keep you posted when I find something interesting ;)
Hey, that paper is actually sth I want to try soon for the ForwardTacotron as well - although I am not sure if it would be beneficial for a non-autoregressive model. Trying taco2 definitely makes sense, I had some success also with conv-only models similar to what they use in MelGAN, there is probably room for improvement! I also thought about using a STT model for extracting the durations. Keep us posted if you find anything interesting!
Oh right, I actually found your repo there https://github.com/NVIDIA/tacotron2/issues/280
I'm currently just using DFR 0.2 without MMI because of some reports there and would also first have to adapt the code to the phone set instead of characters. But this should be obsolete with an explicit duration model.
It's interesting that this duration model is trained together with the rest instead of separately.
I'm quite eager to get rid of attention as it's really the #1 source of issues I encounter.
Yes I found that the model without attention is really robust. It seems to be the general trend to get rid of it. Also worth a look: https://arxiv.org/abs/2006.04558
Oh thanks, didn't see that one yet.
I got the impression that the two lines of research are either to use explicit durations (IBM model, the new Facebook model, Tencent etc) or try to improve on the attention mechanism, like Monotonic attention or https://google.github.io/tacotron/publications/location_relative_attention/index.html
But you really wonder how much of an attention model this actually is if you just use it to attend to a single input at a time.
Yeah thats true. From my experience the duration models perform well enough and they are much faster. Next thing I will try is to use a different approach for extracting durations from the data, probably with a simple STT model with CTC loss.
Just integrated your model into my training framework, preprocessing and with MelGAN and already works quite well after just 20k steps. Audible but noisy, very smooth spectra. Let's see how it evolves.
Also prepared Integration taco 2 layers but first want a baseline.
I wonder if training the duration model separately would be beneficial but I guess it won't make a big difference.
Any reason why you picked L1 loss?
I mostly plan to try forced alignment next, perhaps try subphone units like in HMM systems and a couple smaller things similar to DurIAN, like the skip encoder and the positional index.
Cool, keep me updated - the spectra get much better until 200k steps in my experience. No special reason for L1 over L2, i would not think it makes a big difference. I am trying now to extract the durations with a simple conv-lstm STT model. I use the output log-probablilities to align the mels to the input text with a graph search algorithm. Works pretty well, but so far I don't see it performing better than the alignments extracted from the taco.
It started to converge a bit at around 100k steps (bs32). Stopped for now and trying the suggestions from alexdemartos (prenet output into duration model, duration model from fastspeech).
Implemented positional index here https://github.com/vocalid/tacotron2/blob/b958c7d889b7b6161f56f36b2d525650ff55df3c/model.py#L41 But have to see how to improve that. The torch.gather solution trains with 0.4s/iteration while this one is about 4s (actually 40 sec if you don't move expanded to the gpu first as in the link)
But that's next. Do you plan to release the alignment model?
I planned to use something like https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/alignment/state_align/forced_alignment.py Which worked quite well on small datasets in my experience
Yeah if the model works well I sure gonna open source it. If you're interested check out the (researchy) branch 'aligner and run train_aligner.py and then force_alignment.py. The HM models seem to be standard for extracting alignments, I though want something independent from third parties. I'd be really interested in how the hmm works though!
It started to converge a bit at around 100k steps (bs32). Stopped for now and trying the suggestions from alexdemartos (prenet output into duration model, duration model from fastspeech).
Implemented positional index here https://github.com/vocalid/tacotron2/blob/b958c7d889b7b6161f56f36b2d525650ff55df3c/model.py#L41 But have to see how to improve that. The torch.gather solution trains with 0.4s/iteration while this one is about 4s (actually 40 sec if you don't move expanded to the gpu first as in the link)
But that's next. Do you plan to release the alignment model?
I planned to use something like https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/alignment/state_align/forced_alignment.py Which worked quite well on small datasets in my experience
Hi, just to share with you - I did a couple of tests extracting the durations with a STT model (simple conv-lstm such as a standard OCR model). I overfitted the STT model on the train set, extracted the predition probs and use a graph-search method to align phonemes and mel steps (based on maximising the prediction prob for the current phoneme at each mel step). It works pretty well, the results are intelligible, but the prosody is slightly worse and more robotic than with the taco extracted durations. Any luck yet for you with the forced alignment?
Hey. Haven't tried it yet. I ran lots of variations with taco1 vs taco2 postnet, prenet vs no prenet. I found the prenet before the CBHG didn't really make a difference, neither the postnet choice.
Generally I often see differences in the training loss but none in validation loss.
Biggest difference was exchanging the duration predictor with the fastspeech style one and putting it after CBHG instead of before. Unfortunately I didn't test yet which of the two is the more important modification.
Generally I see a lot more artefacts than with our taco2 model. I train melgan by generating Mel spectra using ground truth durations and it reconstructs them very well. Once I feed Mel spectra generated using predicted durations things get ugly.
Also training with positional indices at the moment but no significant difference either.
I'm also interested in trying a generative model for durations as in https://www.ibm.com/blogs/research/2019/09/tts-using-lpcnet/
Another interesting aspect: I see training loss decreasing after 500k+ steps but validation loss is pretty much stable after about 50k or so. Seems a bit early for overfitting to me.
Thx for the update. As for the duration predictor - I've had the problem of overfitting when I put it after the prenet plus the mels looked worse. As for the increasing validation loss - I think this is kind of normal as the model is not teacher forced and the predicted patterns differ from the ground truth, the audio quality still improves up until 200k steps or so in my experience. I normally do not even look at the validation loss to be honest and judge more by the audio. Also, I have seen quite some artifacts with a pretrained ljspeech melgan + forward taco, but less with wavernn. On our male custom dataset there are quite few artifacts with melgan - I increased melgans receptive field, maybe that helps...
One more thing, when I train the MelGAN, I usually mix ground truth mels with predicted ones as I think it makes training more stable - this could also be worth a try. If you only train on predicted spectra the problem could be that they differ too much from the GT as they are not really teacher forced (e.g. the pitch could be different etc.).
Yeah ive also increased the receptive field as proposed in a paper I forgot. Didnt see the huge improvement they saw but well... Regarding mixing GT with GTA - I could have sworn I did that but strangely only find it in my wavernn codebase. And yes, seemed to make it more robust.
I'll also try multispeaker. With vanilla taco I never could get it to learn attention well for all speakers but spectra were pretty good, so I guess zu should work well with this model. Considering the models I trained using merlin (mostly just 3 LSTMs on HTS Labels) were very well able to produce and mix more than 1000 voices easily in combination with WORLD.
I would also assume that the model goes well with multispeaker, thats quite some work though. For this it makes sense to first find a quicker way of extracting durations probably. I am running another ljspeech training now with melgan. I see improvements of audio quality up to 400k steps on the forward if I test it with the standard pretrained melgan (fewer squeezy artefacts)
Seems it was an issue of patience again. MelGAN loss is hard to interpret and just letting it run often helps. So let it run over the weekend, acoustic model to 500k steps, MelGAN just another day more or so and it's definitely better now. https://drive.google.com/file/d/1YBsS7sxus_tw9PQdr0HVtScGo8Ccuolw/view?usp=sharing
Prosody not yet at the level of the Taco2 model I trained but we're getting closer.
And yes, definitely have to work on the aligner first before tackling multispeaker.
Thats not too bad for melgan and the inference speed you get with both models. I would assume that the durations could be overfitted (did you check the dur val loss?). I am also testing some model variations and I found that it helps to concat the lstm output with the prenet_output:
x = self.prenet(x)
x = self.lr(x, dur)
x_p, _ = self.lstm(x)
x_p = F.dropout(x_p,
p=self.dropout,
training=self.training)
x_p = torch.cat([x, x_p], dim=-1)
x_p = self.lin(x_p)
This is closer to the Taco architecture, where the attention is also concatenated with the lstm output.
Well, validation loss is strange for me ;)
EDIT: zooming in doesn't really help
Seems like instant overfitting...
https://drive.google.com/file/d/1S__-0_3N2swYCsWu4TciZ6O9owkxX7eK/view?usp=sharing https://drive.google.com/file/d/1fIU8SfijwsUg_vEOSykgs7hXb3lO1Fh0/view?usp=sharing
Here is some result with the updated model trained 320k steps together with the pretrained melgan from the repo.
Currently investigating this "overfitting" issue. Been plotting pre und post mel validation error now and it is at the lowest point at around 10k steps and then gradually increases.
Looking at mel spectra from the validation set, this is after 12k steps
After 81k steps
definitely more detail.
Then I've also plotted the error here: 12k steps
81k steps
Seems the error in the formant structure is really higher. I would assume that there might just be some ... shift in the frequency axis that messes up the loss but obviously still sounds fine.
Ground truth
Very cool. That's what I expected too, the structure gets more pronounced but may vary from the ground truth (e.g. different pitch or voice going a bit up instead of down) - as the model is not teacher forced.
BTW I found that with the melgan preprocessing it is necessary to do some cherry picking with the tts model, but training to 400K steps definitely is worth it.
Oh, I got the alignment with HTK to work and while it generally works fine, I currently getting more "raspy" voices and I'm not completely sure if it's because of the alignment. My main issue is that I'm not completely sure how to handle the word boundaries best. Tacotron usually works fine with spaces as wound boundary symbols, but it messes up the aligner in most cases. Except there's really a pause between words.
I think it might be the best solution to not have them in alignment and then use some skip encoder like in DurIAN. Where they keep the word boundaries as separate symbols until the state expansion. If I just drop them completely it strings the words together without any pause ever and it sounds awful.
Well, having diacritics as separate symbols is not really helpful either...
Thats also my experience with durations from a STT model. I tried 1. generating phoneme boundaries (and word boundaries) from the output probabilities and 2. extracting the exact phoneme positions in time and splitting right between them. Both resulted in lower mel quality.
Should change the name of this issue ;)
Still seeing generalization issues. Implemented multispeaker, injecting speaker codes after CBHG but in generally it always defaults to one voice (or I am not fully sure if it's actually am average voice) except if I pick a sentence from training set with the respektive speaker code. Strange as it's fed directly into the LSTM. Pretty strange considering I previously used similar 3 layer LSTM networks where it worked without issues. Currently adding residual connections, like the concat you suggested above around the LSTM and also additive after the postnet like in taco2 but it still seems to do the same thing. Even more interesting - if I pick a sentence from training set with a specific speaker and just change a word it sort of interpolates the whole sentence.
Hmm gotta try synthesizing from pre postnet Mel spectra if it makes a difference. - update: nope, sounds a bit different but already lost speaker information.
Renamed - good research. Did you use durations extracted by a respective tacotron trained separately on each dataset? As for the overfitting - I see the same issues, you mentioned that having a pre-net with heavy dropout did not help, did it? I conducted a lot of unsuccessful experiments, mainly trying various forms of duration prediction, e.g. a separate autoregressive duration predictor (heavily overfitted). Currently I am experimenting with GANS again and got them to work quite well, although the voice quality is not yet better than with standard L1 loss.
Didn't see much difference with the prenet, more dropout etc. I aligned them all using HTK. Tried one-hot encoding any a separate speaker encoder, both resulted in the same.
I'm considering differences to the merlin models I trained previously https://github.com/CSTR-Edinburgh/merlin/blob/33fa6e65ddb903ed5633ccb66c74d3e7c128667f/src/keras_lib/model.py#L132 Where it worked without any issues just concat a one-hot vector to the input sequence. Those also have separate duration models but: No convolutions etc. but just 3 LSTMs - perhaps the simpler model is is the reason. Instead more linguistic and contextual features like position in word, sentence etc. (more or less the HTS label format + additional position index). Also a 5 state subphone model and 5ms hop size.
I've considered 3 subphone units but my durations are sometimes just 2-4 frames with the Taco style hop size. Doesn't make much sense to split those. Actually I'm building a model with a smaller hop size atm as well but still in progress.
Just to verify that overfitting issue might be interesting to throw out CBHG and see how it behaves.
Hi,
meanwhile I tried injecting speaker codes at nearly all possible points - initializing LSTMs, projecting to similar dimension and then adding, concatenating at all time steps etc. But the model still seems to ignore them and rather memorize the speaker identity as a function of the phonetic input. When generating GTA features it therefore gets them all right but once you feed it in synthesis it seems to fully ignore it and pick a voice at random (or most likely - the voice with the most similar input context?).
If it's overfitting so much that it memorizes this I tried to radically reduce the network - similar to https://www.ibm.com/blogs/research/2019/09/tts-using-lpcnet/ where they mention layer sizes of 64 and similar. Both training loss and validation loss were worse but still ignored the speaker embeddings.
https://www.semion.io/doc/can-speaker-augmentation-improve-multi-speaker-end-to-end-tts investigated the effect of injecting at different points, usually using a projection to 64 dimensions and then concatenating. No luck with that either.
Cost me some hair the last two weeks ;)
Thx for sharing! I am idle at the moment (parental leave for 4 weeks), but I am going back to investigate multispeaker for sure after that. Is the data roughly equally distributed? I could imagine that the model tends to turn to the voice with the largest dataset. Plus, It could make sense to look into the contribution of the duration predictor, i.e. try to use separate predictors for separate datasets (or first try to feed target duration of the respective voices) - the model probably fits on the duration distributions as well.
Congratulations, I'm also trying to work between a 4 year old and an almost 1 year old. Challenging ;). Yeah I tried reducing LJ to the 2-3k sentences most other speakers have and also mixed in VCTK to get lots of speakers. Tried one-hot encoding as well as a speaker embedding from an encoder model.
With lots of LJ data it definitely produced lots of LJ. With a more balanced set it seems pretty much every sentence uses a different voice. But always the same voice no matter which input embedding I use. So I guess it learns which text context is spoken by which voice.
Challenging indeed, but congrats! Seems like you tried the same approaches that I would go about. To verify one could look at the activations from the speaker embedding input (or correlations with the output voice identity, which probably takes some time to implement). As for a solution - did you try to use large input dropout again? Also, It could make sense to try to regularize using variable input lengths (e.g. just feed random parts of the input sentences). Probably the most low hanging fruit would be to increase the dataset using many different speakers to reduce overfitting. I am just starting to investigate multispeaker and will keep you posted!
Hi, so I am finally back and did some multispeaker research. Starting point was this repo: https://github.com/CorentinJ/Real-Time-Voice-Cloning and the corresponding paper https://arxiv.org/pdf/1806.04558.pdf. I implemented a multispeaker tacotron with the speaker embedding from https://github.com/resemble-ai/Resemblyzer. Started out using the LibriTTS corpus but found it pretty noisy and had problems to get stable attention, then moved to the VCTK corpus. Tacotron trained ok and also reproduced the voices in inference (with r=3) given a specific speaker embedding (extracted from a sample wav). Attention was a bit shaky but good enough to extract durations for a first ForwardTacotron training. First Implementation is here: https://github.com/as-ideas/ForwardTacotron/blob/multispeaker/models/forward_tacotron.py. It seems to work pretty well already actually, see the samples: https://drive.google.com/drive/folders/1qr2nFbmdjJX-ZI8LU7CpghxjSXoJUGiS?usp=sharing
Hi, I've been working with another taco implementation originally and also tried CorentinJs encoder. Also worked quite well regarding speaker identity, main issue was getting good alignment for all speakers.
But with forward taco I get the describe issues. At first glance my implementation seems quite similar to yours (no wonder, just repeat and concat ;)). I'll dig deeper if I find any differences. Did you train on a larger batch size or so?
Main differences are that I put the duration model after CBHG and concat embeddings before LR (but they should just be repeated with the rest). EDIT: Wonder if putting the duration model after the CBHG might really affect this, as it's much simpler and more likely to overfit. Although I multiplied the duration loss by 0.1 so it got a less strong effect (I've later seen that they did the same in FastPitch, also with 0.1 ;)). Generally I wonder if having it as a separately trained model (e.g. like the prosody model in https://www.ibm.com/blogs/research/2019/09/tts-using-lpcnet/ ) might make sense, or at least freeze it after some time.
Yeah I found that the duration model is hugely overfitting when applied after CBHG, could be a problem. You could try to just replicate my implementation and go from there. The stratification by speaker really seems to work well for me. Could be also the dataset, VCTK has about 100 speakers with about equal number of files. I also thought about an independend prosody model, def. will look into it.
Well, while my model is pretty similar, the whole repo is based on the nvidia taco2 repo and overall vastly different. I've also added positional index and other stuff that might result in differences but intuitively I would suspect that the duration model after CBHG might still have a strong effect on the CBHG...
I'm for sure experimenting with different architectures and give you an update!
Thanks. I am now running with duration model before CBHG (I am using that FastSpeech one, so this is also a bit different). I also got a pitch model in there, so that might also interfere, but I think I saw this issue before I had it. To make sure I also do that one before CBHG now. Before that it was quite similar to https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/FastPitch/img/fastpitch_model.png with CBHG instead of the transformer blocks. Subjectively I don't hear a significant difference to that much larger model. In the IBM paper above they used even smaller layers and sounds pretty good (although just 16kHz)
Ah cool. I'm also planning on messing with a pitch predictor, the fastpitch samples seem quite convincing. Let me know how it goes!
Oh it worked quite well without any real issues. The more I wondered that it ignores my speaker IDs while those work nicely. I'm just at 20k steps with the modifications above but also seems to output me some random voice, checking the embeddings again, weird.
Update 2: OMG I think I got it. Such a stupid bug. I've checked the embeddings on inference if they match the speaker, I've checked them in the data loader if they match the filename etc. I've checked in forward if they vary in each batch and the repeat does its work correctly. But I did NOT check the collate function :scream:
Such a simple stupid bug and took me longer than that X11 forwarding syscall issue here https://github.com/mozilla/TTS/issues/417#issuecomment-650186085 ;)
:angry:
Retraining now....
Update 3: Working, 3 samples after just 2k steps (like 10 minutes of training) ms.zip
Hi there @cschaefer26, nice project! Regarding https://github.com/as-ideas/ForwardTacotron/issues/7#issuecomment-688695349, while integrating fatchord's tacotron model with https://github.com/CorentinJ/Real-Time-Voice-Cloning (my work is in #472), I've also encountered the same problems you had with LibriTTS, which are mainly caused by the highly inconsistent prosody between speakers. You can get much better results by preprocessing or curating the dataset (either trimming mid-sentence pauses or discarding utterances when that occurs). VCTK works a lot better if you trim the silence at the beginning and end of each file. I can go into more detail if it is helpful.
The baseline tacotron requires very clean data for multispeaker, and even then I'm having trouble producing a decent model. Which is what leads me to your repo. :) I will be trying it out. Keep up the great work!
Hi @m-toman I totally missed your update. Sounds really good, I assume its melgan? I got some ok results usind VCTK, as @blue-fish states the datasets require some good trimming etc. I found this to be really helpful: https://github.com/resemble-ai/Resemblyzer/blob/cf57923d50c9faa7b5f7ea1740f288aa279edbd6/resemblyzer/audio.py#L57
Any updates? We are also looking into adding GST. The main problem I have right now is that I would nead really clean german datasets to benefit from transfer learning for our use case. I also looked into other open source repos and tested Taco2 etc. but found it not really to perform much better.
Currently, I am looking into some different preprocessings, e.g. mean-var scaling to improve the voice quality.
Hi. I have mixed in vctk (also with this trimming ;)) but I felt that the larger speakers I added lost a bit of prosody/felt flatter than when trained individually. Wanted to investigate further but did not get to it yet. Yeah it's melgan.
Yeah it's interesting. As I can't really believe it I regularly compare to taco2 and other more complex methods out there (https://github.com/tugstugi/dl-colab-notebooks) but neither attention nor autoregression really seem to make a significant difference.
Regarding styles I would have considered something like the simple method presented in DurIAN which is mostly just style embedding. Also read the flowtron paper again and thought about wrapping the whole model in such a Flow formulation but after listening to the samples again I felt it's probably not worth it vs just playing with the pitch predictor I got (might also be possible to predict and sample F0 from a gaussian where you could then play with the variance).
I would have to read the GST paper again but I felt the control is a bit top random when I remember correctly? So in the sense that the tokens are hard to interpret and probably different with each run?
Yeah exactly, although they show some impressive results absorbing background noise into the tokens. I would probably think that pitch prediction is the lowest hanging fruit of them all...
I feel we're getting to a similar state of saturation like we had it before deep learning entered the speech synthesis field. The HMM-based methods became so loaded with more and more tricks and features, the complexity was insane. The training script I used during my PhD consisted of I think 120 separate steps in the end, each calling HTS tools with dozens of command line parameters and additional script files ;). Recently there have been so many approaches to make attention work better for this use case, like the monotonic methods that force it to either take a step or stay in the current state and only attend to a single input. With lots of weird tricks to make it differentiable etc. At that point it's so far from the origins that it seems awkward to even use attention. The seq2seq AR approach also means dependence on the stop token prediction (who did not end up with 30 seconds of garbled speech please raise their hands ;)). https://arxiv.org/abs/1909.01145 was an interesting paper but it's yet again another rather complicated workaround for the issues introduced by AR, besides scheduled sampling/curricum learning (which introduced now robustness issues) and gradually decreasing r and stopping at r=2 (although that works quite well) to keep it from predicting from the previous samples ignoring conditioning information.
i admit I wasn't brave enough to just try what you did and throw all that stuff out. The thousand people at Google would have certainly done that, right? :) Enough ranting, curious what you will achieve with GST, I'll further play with multispeaker soon.
Good rant though! The more I test the autoregressive stuff the less impressed I am. Its basically not usable for us in production (we are trying to synth long german politics articles). The forward models are pretty robust though. I wish I could get rid of the AR model to extract durations, we experimented with the aligner module from google EATS, didnt work. Extracting with an STT model worked but quality was worse. Today I spent the whole day debugging why my forward model all of the sudden sucked badly and found that the tacotron alignments were shifted - somehow I got unlucky, increasing the batch size solved this. When I started with TTS I was wondering why people got so interested in thes attention plots, now I know - watching a tacotron giving birth to attention is one of my good moments :D .. Honestly the forward models seem to be SOTA now and are probably used in production for Microsoft, AWS, google...
My samples above used alignment via HTS and didnt notice a difference to the taco attention ones. Using those scripts https://github.com/CSTR-Edinburgh/merlin/tree/master/misc/scripts/alignment/state_align Just a bit annoying to set up,
Think it's mostly Google that still clings to it. https://arxiv.org/abs/1910.10288
Yeah was astonished to see that Springer does TTS ;).
My ex-colleagues recorded this corpus https://speech.kfs.oeaw.ac.at/mmascs/ Unfortunately too small for deep learning stuff (was fine for HMM based synthesis) but it's good quality and in different speaking rates might be useful for the duration model. We did lots of mocap recordings back then, was fun ;)
Edit: and obviously it's Austrian German (Vorarlberg in this case ;)). Here some fun dialect interpolation samples http://mtoman.neuratec.com/thesis/interpolation/
Cool, I actually just found a glitch in the duration extraction of my STT models and it seems fine now. Probably going to release that as its cumbersome to train a tacotron for extractions. Good stuff! Ill keep you updated on how it goes with pitch prediction, multispeaker etc.
Hi,
great work! I saw your autoregression branch and wanted to ask if it worked out? I always wondered how much the effect of the autoregression (apart from the formal aspect that it then is a autoregressive, generative model P(xi|x<i)) really is, considering there are RNNs in the network anyway.
Also, wanted to point you to this paper in case you don't know it yet: https://tencent-ailab.github.io/durian/
They use, similarly to older models like in https://github.com/CSTR-Edinburgh/merlin, an additional value for the expanded vectors to indicate the position in the current input symbol. Wonder if that would help a bit with prosody.