Alignments for BPE token

cisnlp / simalign

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)

MIT License

345 stars 47 forks source link

Alignments for BPE token #34

Closed moore3930 closed 1 year ago

moore3930 commented 1 year ago

Hi, I just wonder whether simalign supports the feature of extracting alignments at BPE level?

masoudjs commented 1 year ago

Hi, I think the easy way could be to give BPE-segmented text as input (instead of word-segmented). Then the model treats the BPEs as words.

The other way is to edit the code: https://github.com/cisnlp/simalign/blob/05332bf2f6ccde075c3aba94248d6105d9f95a00/simalign/simalign.py#L232 In this line, we convert the aligned BPE indexes (i, j) to word indexes for source and target tokens. You can just keep (i, j) and not convert them. You can find the mappings in l1_b2w_map and l2_b2w_map.

moore3930 commented 1 year ago

Many thanks for your quick reply. A concern is that it requires using the same tokenizer of the basic pretrained model, e.g., mBERT, for my personal task, right?

Anyway, I will try it later following your suggestion and let you know whether it does work.