bert encoder try to minimized negative loglikelihood of y and y^ , in this case, y^ is the responses ground truth for each input y and x, y is response predicted through bert encoder model? > is that right?
and another phase is generative base I marked it like a bert decoder - because bert doen't have a tokenizer decoder , so we train a transformer like a decoder to get a sentence from bert encoder output?
I also mention before that transformer has many architure right now (huggingface), so it makes confuse to everybody come up with this method.
Hope you answer these questions
As I understand, ED used
and another phase is generative base I marked it like a bert decoder - because bert doen't have a tokenizer decoder , so we train a transformer like a decoder to get a sentence from bert encoder output?
I also mention before that transformer has many architure right now (huggingface), so it makes confuse to everybody come up with this method. Hope you answer these questions