cisnlp / simalign

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
MIT License
345 stars 47 forks source link

How to get matchings from alignment #29

Closed VanderpoelLiam closed 2 years ago

VanderpoelLiam commented 2 years ago

I have the following example: Sentence A: a # 9.8 m deficit recorded for 2014/15 at an essex hospital is to be investigated by a health service watchdog. Sentence B: A £9.8m deficit recorded for 2014/15 at an Essex hospital is to be investigated by a health service watchdog.

When I run the following: myaligner = simalign.SentenceAligner(token_type="word") aligns = myaligner.get_word_aligns(sentence_A, sentence_B)['itermax']

This produces an aligns of the form: [(0, 0), (2, 1), (4, 2), (5, 3), (6, 4), (7, 5), (8, 6), (9, 7), (10, 8), (11, 9), (12, 10), (13, 11), (14, 12), (15, 13), (16, 14), (17, 15), (18, 16), (19, 17), (20, 18)]

I cannot figure out how you then produce a matching of the form: [(0, 0), (1, 1), (2, 1), (3,1) (4, 2), (5, 3), (6, 4), (7, 5), (8, 6), (9, 7), (10, 8), (11, 9), (12, 10), (13, 11), (14, 12), (15, 13), (16, 14), (17, 15), (18, 16), (19, 17), (20, 18)]

This is done on the interactive website in order to produce the graphs but I cannot find where you do something of this form in the code provided.

Thanks in advance!

pdufter commented 2 years ago

@VanderpoelLiam the alignments on the website are computed on the subword level. Do you also get different results when you set token_type="bpe"?

VanderpoelLiam commented 2 years ago

Thanks! Adding token_type="bpe" fixes my issue