Fix segment ids and sentence lengths

HHousen / TransformerSum

Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.

GNU General Public License v3.0

429 stars 58 forks source link

By default, since we use segment_token_id to be SEP token, we need to toggle the current_segment_flag after appending it to the segment_ids list
Sentence length for the last sentence was missing

Below example shows 2 sentences in src. However, token_type_ids / segment_ids indicate 3 sentences because of the toggle issue and sent_lengths indicate 1 sentence because it misses the last one.

Example output before PR:

In [7]: j[0]
Out[7]:
{'src': [['how',
   'do',
   'i',
   'do',
   'my',
   'chores',
   'my',
   'parents',
   'wants',
   'me',
   'to',
   'do',
   '.'],
  ['thank', 'you', 'so', 'much', '!']],
 'labels': [1, 0]}

In [8]: t[0]
Out[8]:
{'input_ids': [0,
  9178,
  109,
  939,
  109,
  127,
  21350,
  127,
  3254,
  1072,
  162,
  7,
  109,
  479,
  1437,
  2,
  1437,
  0,
  3392,
  47,
  98,
  203,
  27785,
  2],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0],
 'labels': [1, 0],
 'sent_rep_token_ids': [0, 17],
 'sent_lengths': [17]}

The changes in this PR would match the segment_id alignment from the BertSum paper: Screen Shot 2020-10-28 at 5 16 54 PM

HHousen / TransformerSum

Fix segment ids and sentence lengths #35