cisnlp / simalign

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
MIT License
345 stars 47 forks source link

Indices missing (just for ArgMax) #33

Closed LeonHammerla closed 1 year ago

LeonHammerla commented 1 year ago

i am facing problems when aligning sentences, where one contains spelling mistakes. For the method ArgMax the result is missing indices.

For Example: 2 sentences: ['Ds', 'ist', 'en', 'Test', '.'] ['This', 'is', 'a', 'test', '.'] Method ArgMax --> [(1, 1), (2, 2), (3, 3), (4, 4)] (is missing (0, 0))

Method Match --> [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)] (is correct)

pdufter commented 1 year ago

That’s not an issue with the library but simply due to the fact that the similarities in the underlying model are not as you expect them to be.

In order to fix that you would need to “correct” the vector representations in the underlying model.

LeonHammerla commented 1 year ago

But shouldnt every index in the base sentence match at least one index in the target sentence?

masoudjs commented 1 year ago

The ArgMax method will not necessarily assign an alignment edge to all tokens. It only adds an edge between x and y if y has the highest cosine similarity to x and vice versa.

It's the Match method that assigns a target word to each source word (if the source sentence has fewer tokens).

LeonHammerla commented 1 year ago

He

The ArgMax method will not necessarily assign an alignment edge to all tokens. It only adds an edge between x and y if y has the highest cosine similarity to x and vice versa.

It's the Match method that assigns a target word to each source word (if the source sentence has fewer tokens).

Ok, thanks...thats a reasonable answer.