HHousen / TransformerSum

Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
https://transformersum.rtfd.io
GNU General Public License v3.0
429 stars 58 forks source link

Fix segment ids and sentence lengths #35

Closed salman1993 closed 4 years ago

salman1993 commented 4 years ago

Below example shows 2 sentences in src. However, token_type_ids / segment_ids indicate 3 sentences because of the toggle issue and sent_lengths indicate 1 sentence because it misses the last one.

Example output before PR:

In [7]: j[0]
Out[7]:
{'src': [['how',
   'do',
   'i',
   'do',
   'my',
   'chores',
   'my',
   'parents',
   'wants',
   'me',
   'to',
   'do',
   '.'],
  ['thank', 'you', 'so', 'much', '!']],
 'labels': [1, 0]}

In [8]: t[0]
Out[8]:
{'input_ids': [0,
  9178,
  109,
  939,
  109,
  127,
  21350,
  127,
  3254,
  1072,
  162,
  7,
  109,
  479,
  1437,
  2,
  1437,
  0,
  3392,
  47,
  98,
  203,
  27785,
  2],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0],
 'labels': [1, 0],
 'sent_rep_token_ids': [0, 17],
 'sent_lengths': [17]}

The changes in this PR would match the segment_id alignment from the BertSum paper: Screen Shot 2020-10-28 at 5 16 54 PM

HHousen commented 4 years ago

@salman1993 Nice fix. So, all models trained using sent_rep_tokens as the pooling mode should still be correct. Those trained using mean_tokens or max_tokens have incorrect results because of this bug: The last sentence was treated as padding essentially. Furthermore, I have to regenerate all the data for all extractive datasets in order to add the last sentence length and fix the token_type_ids. I have an implementation of #27 that is nearly done. So, when I regenerate the data I'll use the new dataset format.