seg_mask is not computed correctly

The computation of seg_mask relies on the fact that the sum of a segment composed of only pad token ids is equal to zero. https://github.com/coastalcph/trldc/blob/b843576875654bc887e904777ecc4a0dc3091ba5/dainlp/models/cls/hierarchical.py#L59

But with RoBERTa this is not true since by default pad_token_id = 1. (source)

It would be better to compute seg_mask using attention_mask instead of input_ids.

coastalcph / trldc