coastalcph / trldc

Transformer-based Long Document Classification
15 stars 6 forks source link

seg_mask is not computed correctly #1

Closed lorsanta closed 2 years ago

lorsanta commented 2 years ago

The computation of seg_mask relies on the fact that the sum of a segment composed of only pad token ids is equal to zero. https://github.com/coastalcph/trldc/blob/b843576875654bc887e904777ecc4a0dc3091ba5/dainlp/models/cls/hierarchical.py#L59

But with RoBERTa this is not true since by default pad_token_id = 1. (source)

It would be better to compute seg_mask using attention_mask instead of input_ids.

dainlp commented 2 years ago

thank you for spotting this, i have fixed it