The decoder shouldn't be able to see future tokens in the sequence.
The current implementation holds a words variable in the MultiHeadAttention class that requires it be changed after each token prediction. There should be a better way to do this, and perhaps the variable should be moved to the Transformer class and be passed to MultiHeadAttention via function parameters.
The decoder shouldn't be able to see future tokens in the sequence.
The current implementation holds a words variable in the MultiHeadAttention class that requires it be changed after each token prediction. There should be a better way to do this, and perhaps the variable should be moved to the Transformer class and be passed to MultiHeadAttention via function parameters.