allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.04k stars 274 forks source link

sentence and paragraph prediction for hotpotqa #142

Open Fan-Luo opened 3 years ago

Fan-Luo commented 3 years ago

Hi

In the page 14 of the paper,

For evidence extraction we apply 2 layer feedforward networks on top of the representations corresponding to sentence and paragraph tokens to get the corresponding evidence prediction scores and use binary cross entropy loss to train the model.

Is the 2 layer feedforward networks something likes

torch.nn.Sequential(
          torch.nn.Linear(self.model.config.hidden_size, self.model.config.hidden_size) 
          torch.nn.ReLU(),
          torch.nn.Linear(self.model.config.hidden_size, 1),      # score for 'yes', while 0 for 'no'
        )

Besides, do you predict sentence and paragraphs separately or nested? The reason I ask is that gold sentences are always inside of gold paragraphs. When predict and compute the support f1, does a sentence considered as predicted positive has to be within a paragraph predicted positive? Do you use a hard threshold for both sentence and paragraph prediction?

Thank you

armancohan commented 3 years ago

Yes the two layer FF looks exactly what you have. (We found GeLU to work better, instead of RELU).

At training time we don't specify any threshold for number of paragraph or sentences. At inference time, we have a constrained decoding method that ensures sentences comes from exactly two paragraphs, a property of this dataset. This is similar to https://arxiv.org/pdf/2004.06753.pdf (see section 3.3).

Fan-Luo commented 3 years ago

Thank you for your reply.

At training time, you compute the loss of sentence and paragraph independently, right? At inference time, according to the paper "A Simple Yet Strong Pipeline for HotpotQA" :

We define the score n(S) of a set of sentences S ⊂ D to be the sum of the individual sentence scores; In HotpotQA, supporting sentences always come from exactly two paragraphs. We compute this score for all possible S satisfying this constraint and take the highest scoring set of sentences as our support.

My understanding is that this paper use the sum of sentence logit scores as the paragraph score. I wonder do you use the logit score from <t> instead of the sum as the paragraph score? Then you choose the top2 as the predicted paragraphs?

For decoding sentence prediction, do you take the highest scoring set of sentences from the top2 predicted paragraphs? How do you define the highest scoring set, by specifying a threshold for number of sentences or a score threshold? Would you mind to provide the threshold you chosen?

For decoding the answer prediction: You mentioned yes no null is appended to the end of the input sequence, this converts yes/no answer to a span in the context. But what if the predicted answer is an actual span in the context, while the predicted question type is one of yes/no/null? or what if the predicted answer is yes/no/null (or even yes no or no null), while the predicted question type is span?

Happy Thanksgiving!