Open cassiotbatista opened 1 year ago
Both Coqui and ESPnet have been a pain so far, the former more than the latter.
Coqui can generate alignments externally with a Tacotron model, as in FastSpeech's v1, but the default behaviour is to train an alignment head end to end (ref?). Besides, it seems to have moved on with its char utils but not with the script that computes att masks, which I think is importing outdated stuff. The plan was to take a look at what kind of alignments they are producing with Tacotron, so I could reproduce stuff in the same format with MFA later, but RN I couldn't get any of it work.
ESPnet supports MFA but I'm having trouble with MFA server's Postgree's connection (???). RN it seems my best option because it looks like the problem is more from MFA than ESPnet, which should (hopefully) be easier to solve.
There were complaints regarding the size of the m/f dataset, which was not large enough to draw conclusions. The idea then is to train a phoneme-based TTS such as FastSpeech 2 using two forced aligners and take a look at some similarity metric (e.g., PESQ) between the synthetic and the original voices. Training and test data would be FalaBrasil's Constituição dataset.