THUDM / OAG-BERT

A heterogeneous entity-augmented academic language model based on Open Academic Graph (OAG)
MIT License
76 stars 5 forks source link

Some questions about the pre-trained mask strategy #7

Open biandh opened 1 year ago

biandh commented 1 year ago

In the Span-aware entity masking. section of the paper it is mentioned that "If the sampled length is less than the entity length, we will only mask out the entity. For text contents and entity contents, we mask 15% of the tokens for each respectively. "

I have 2 points of confusion here

First: "If the sampled length is less than the entity length, we will only mask out the entity." I can't understand the meaning of this sentence, assuming Geo(p) == 6 and entity_len == 7, here it means mask_len == 7 ? but when Geo(p) == 6 and entity_len == 5, what to do? Can you help with an example?

Second: "we mask 15% of the tokens for each respectively", for entity, I am very confused, this is to choose 15% of the tokens for each entity OR choose 15% of the mask for all entities? Combined with the first question, here is how to guarantee a 15% probability?

Looking forward to your reply.

biandh commented 1 year ago

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?

Looking forward to your reply.

Somefive commented 1 year ago

Second: "we mask 15% of the tokens for each respectively", for entity, I am very confused, this is to choose 15% of the tokens for each entity OR choose 15% of the mask for all entities? Combined with the first question, here is how to guarantee a 15% probability?

"choose 15% of the mask for all entities". The 15% probability is not strictly enforced. We select one entity randomly, mask it, check if the total number of masked tokens reaches 15% of all tokens. If not reached, repeat this process.

Somefive commented 1 year ago

First: "If the sampled length is less than the entity length, we will only mask out the entity." I can't understand the meaning of this sentence, assuming Geo(p) == 6 and entity_len == 7, here it means mask_len == 7 ? but when Geo(p) == 6 and entity_len == 5, what to do? Can you help with an example?

For example, if the sentence length is 100, we want to mask 15 tokens. If we randomly picked one entity that has 17 tokens, we will mask all 17 tokens even if it is longer than 15.

Somefive commented 1 year ago

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?

Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

biandh commented 1 year ago

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ? Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

Thank you very much for help. The question I want to say is, do you only use MLM in the loss calculation ? and do you use SpanBert's Span Boundary Objective loss ?

Looking forward to your reply.

biandh commented 1 year ago

When reading the code implementation part of title generation, I found that the decoding strategy generation method is different of FOS. It is more like the method of Prefix LM. I don't quite understand why the same generation strategy as FOS is not used. Can you share the reason here ? THX

Somefive commented 1 year ago

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ? Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

Thank you very much for help. The question I want to say is, do you only use MLM in the loss calculation ? and do you use SpanBert's Span Boundary Objective loss ?

Looking forward to your reply.

No, we only use MLM. It is possible to use SpanBert's loss. Maybe it can help the model to learn span information more efficiently.

Somefive commented 1 year ago

When reading the code implementation part of title generation, I found that the decoding strategy generation method is different of FOS. It is more like the method of Prefix LM. I don't quite understand why the same generation strategy as FOS is not used. Can you share the reason here ? THX

We have tried various ways to train and inference. I suppose you read the code in cogdl? The code there is not totally equivalent to the strategy in the paper, as several post-updates made, but still the general ideas are the same. The code in cogdl is more of inferencing, instead of directly training (if I remember it correct and no further updates were made). The MLM loss is the first attempt we made for learning entity information, targeting at comprehension. However, for language generation tasks, the so-called "Prefix LM" is more helpful for generating sequences, in terms of efficiency and quality.

We actually tried to use GLM or other advanced masking strategy to train the model and get better parameters which are more suitable for the sequence generating tasks.

As far as I know, if your downstream tasks are mainly comprehension work, like cloze tasks, using MLM for training could be good. But I remember there are researches indicate that training like GPT could also achieve qualitative results. For sequence generation, pure MLM training is somehow harder I think.