Closed salman1993 closed 4 years ago
@salman1993 Nice fix. So, all models trained using sent_rep_tokens
as the pooling mode should still be correct. Those trained using mean_tokens
or max_tokens
have incorrect results because of this bug: The last sentence was treated as padding essentially. Furthermore, I have to regenerate all the data for all extractive datasets in order to add the last sentence length and fix the token_type_ids
. I have an implementation of #27 that is nearly done. So, when I regenerate the data I'll use the new dataset format.
segment_token_id
to be SEP token, we need to toggle thecurrent_segment_flag
after appending it to thesegment_ids
listBelow example shows 2 sentences in
src
. However,token_type_ids
/segment_ids
indicate 3 sentences because of the toggle issue andsent_lengths
indicate 1 sentence because it misses the last one.Example output before PR:
The changes in this PR would match the segment_id alignment from the BertSum paper: