Kyubyong / dc_tts

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model
Apache License 2.0
1.16k stars 370 forks source link

Horizontal Attention plot at synthesis #27

Open noetits opened 6 years ago

noetits commented 6 years ago

If you try, in synthesis, to save and show Attention computed with the model pretrained on LJ-speech for example, it will look like this: alignment_3

Why is it horizontal and not diagonal like during the training ? The synthesis works just fine though ...

If I comment, in "networks.py", in the function "Attention" the part corresponding to "monotonic attention" like this:

    A = tf.matmul(Q, K, transpose_b=True) * tf.rsqrt(tf.to_float(hp.d))
    # if mononotic_attention:  # for inference
    #     key_masks = tf.sequence_mask(prev_max_attentions, hp.max_N)
    #     reverse_masks = tf.sequence_mask(hp.max_N - hp.attention_win_size - prev_max_attentions, hp.max_N)[:, ::-1]
    #     masks = tf.logical_or(key_masks, reverse_masks)
    #     masks = tf.tile(tf.expand_dims(masks, 1), [1, hp.max_T, 1])
    #     paddings = tf.ones_like(A) * (-2 ** 32 + 1)  # (B, T/r, N)
    #     A = tf.where(tf.equal(masks, False), A, paddings)
    A = tf.nn.softmax(A) # (B, T/r, N)
    max_attentions = tf.argmax(A, -1)  # (B, T/r)
    R = tf.matmul(A, V)
    R = tf.concat((R, Q), -1)

The attention plot will be of diagonal shape, and the synthesis not too bad but will have the problem mentioned in the paper: may skip letters or pronounce several times parts of words.

inventor617 commented 1 year ago

@noetits Hi, how did you solve the problem?What kind of verions of python, GPU did you use? I have the same problem.