Open BillXuce opened 3 years ago
Unfortunately there are a few places in the repository where the code conflicts with the paper. I'm assuming that when the authors say they "dropped" the query portion, what they mean is that when the loss is calculated they applied the start/end label masks. I don't know for sure, but it's just my assumption.
According to the Section 3.3.1 in the paper, the input of BERT consists of query and context whose length should be
seq_len = n+m+2
and the output drops the representations of the query and special tokens. However, inbert_query_ner.py
line 44 and 45, the shape ofsequence_heatmap
coming from BERT output is[batch_size, seq_len, hidden_size]
, which conflicts with the papers. So which method should be applied? How much difference between both(in performance)?