ZhangShiyue / QGforQA

MIT License
96 stars 21 forks source link

A quick question on the use of BERT Embeddings #11

Open ByrdOfAFeather opened 4 years ago

ByrdOfAFeather commented 4 years ago

I have a quick question about how BERT Embedding are used in this model. Based on this line it seems that questions are only generated if there is a ground truth to get the CLS token from BERT. Is this what is going on behind the scenes? I'm currently trying to implement a similar model in PyTorch and I'm running into an issue on what to use as the first input to the decoder and it centers on the highly-contextualized nature of BERT.

Thanks, Matthew.

ZhangShiyue commented 3 years ago

Hi, I think your understanding is correct. I used the [CLS] token from groundtruth question as the start input to decoder to generate a new question.

ByrdOfAFeather commented 3 years ago

Thanks for the response - wouldn't this mean the model has a very difficult time generalizing? What do you do in the case where the model doesn't have a ground truth question to get the [CLS] token from?

ZhangShiyue commented 3 years ago

No. You can replace the ground truth question with any [CLS] XXXX sentence (for example, [CLS] [PAD], [PAD]..). It doesn't matter. [CLS] is just the start token to initialize generation. And this setting is commonly used in many generation models.

ByrdOfAFeather commented 3 years ago

I see. One concern though, on trying to implement this myself, I noticed that performance between train and test time varied significantly. Typically, I would just use a [CLS] token from the input [101], which is the index indicating the start of a sentence for BERT - at least for the pytorch implementation. You can see this here, note that this code is from a previous commit. I later changed the model to use GLoVE embeddings (there are some other minor errors in the code at this point as well).

I noticed that the model was doing fine during training but could not generalize during inference time. I believe this came down to the fact that the [CLS] token is a sentence representation that BERT outputs, not just a start token. The model seemed to be able to map CLS tokens coming from real sentences to the full sentence that they came from, but did not seem to be able to generate questions based on the context.

I take it that this was not seen during testing. That's interesting. I don't see why a sentence embedding with PADD tokens would be all that different than just a sentence embedding with CLS, and thus why it would be similar to what the model sees during training time.

Thanks for the answer! - I'll leave it up to you to close or if you have any further insights.