cisnlp / simalign

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
MIT License
347 stars 47 forks source link

similarity alignment of sentences #17

Closed jiangweiatgithub closed 3 years ago

jiangweiatgithub commented 3 years ago

I know this might sound irrelevant, but can the logic of aligning words in two sentences be used to align sentences in two articles?

pdufter commented 3 years ago

Hi @jiangweiatgithub if I understand you correctly this sounds like parallel sentence mining or sentence retrieval. You can use BERT for such a task but I guess there are alternatives that work much better (maybe checkout Sentence-BERT, classical approaches like tf-idf, or methods proposed in the BUCC shared task for parallel sentence mining).

jiangweiatgithub commented 3 years ago

Thank you for your prompt response, @pdufter ! You got me right. I had checked BUCC, but it seems that all the entries' calculation of sentence similarity does really consider the context that a specific sentence is located. I guess such context info is used by simalign, right?

pdufter commented 3 years ago

SimAlign does not consider any cross-sentence context. Is that what you meant?

jiangweiatgithub commented 3 years ago

I mean, when SimAlign is trying aligning a word - in Sentence A with two or more possible words in Sentence B, it will give more weight to the one that is located within the context - the word before and/or the word after - that is already aligned for sure. For example: Sentence A in English : I like buying books, not reading books. Sentence B in Chinese, words segmented by space: 我 喜欢 买 书,而非 读 书 。

As you might see or guess, both instances of “book” are translated into "书", Assuming "buying" & "," and "reading" and "." have been aligned, both instances of "book" should be aligned as well.

pdufter commented 3 years ago

Yes, given that mBERT computes contextualized embeddings ans SimAlign just uses them directly means that this context is considered. Also for this kind of alignment positional embeddings have a big (not always good) influence.

jiangweiatgithub commented 3 years ago

Is this related to the distortion correction parameter?

pdufter commented 3 years ago

The distortion parameter can be used to push alignments more to the diagonal (i.e., similar relativ position in the sentences). But mBERT already has position embeddings which yield a similar effect, thus the distortion parameter does not have a big impact when using mBERT.

jiangweiatgithub commented 2 months ago

Recently I revisited this sentence alignment, by passing in two lists of whole sentences and got some meaningful alignment result. Is that intentional or just accidental?

pdufter commented 2 months ago

Hm, not sure about this. There is generally not intention for sentence alignment in SimAlign. Can you share more details what you did exactly?

jiangweiatgithub commented 2 months ago

Here you go with the code:

from simalign import SentenceAligner

def read_first_n_lines(file_path, n):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = [file.readline().strip() for _ in range(n)]
    return lines

# making an instance of our model.
# You can specify the embedding model and all alignment settings in the constructor.
myaligner = SentenceAligner(model="bert", token_type="bpe", matching_methods="mai")

# The source and target sentences should be tokenized to words.
src_sentence = read_first_n_lines(r'X:\repos\similarity_analysis\man_065.txt',65)
trg_sentence = read_first_n_lines(r'X:\repos\similarity_analysis\woman_051.txt',51)
print(len(src_sentence))
print(len(trg_sentence))

# The output is a dictionary with different matching methods.
# Each method has a list of pairs indicating the indexes of aligned words (The alignments are zero-indexed).
alignments = myaligner.get_word_aligns(src_sentence, trg_sentence)

#print(alignments)
for matching_method in alignments:
    print(matching_method, ":", alignments[matching_method])