bfsujason / bertalign

Multilingual sentence alignment using sentence embeddings
GNU General Public License v3.0
91 stars 42 forks source link

More info on configuration options #4

Open RacheleSprugnoli opened 1 year ago

RacheleSprugnoli commented 1 year ago

Hi, thanks for providing this code! Could you please give more information (e.g. a brief explanation) of the following options?

Thank you in advance! Rachele

bfsujason commented 1 year ago

max_align is the maximum alignment types such as 1:1, 1:2, etc. 5 means the alignments allowed are 1:0, 0:1, 1:1, 1:2, 2:1, 2:2, 2:3, and 3:2. You can set this parameter to a higher value if the corpus to be aligned contains many complex alignments.

top_k is for the search of k nearest target neighbors of each source sentence in the first-step alignment.

win is the search window of dynamic programming in the second-step alignment.

skip is the predefined simililarity score for 1:0 and 0:1 alignments. If your corpus consists of many omissions and insertions, you can set this value to a larger one, e.g. skip=0.

margin represents modified cosine similarity as proposed in https://doi.org/10.1093/llc/fqac089.

len_penalty considers the length difference between source and target sentences when calculating similarity between sentence pairs.

If is_split=True, it means the corpus has already been split into sentences. Otherwise, bertalign uses sentence-splitter to split the bitexts into sentences.

jdough1982 commented 3 months ago

Hi. Is there a way to specify max_align with more granularity? For my use case, I would like to limit the allowable alignments to 1:1, 1:2, ..., 1:n, and the inverse thereof (1:1, 2:1, ..., n:1). In other words, I want to exclude 1:0, 0:1, and many-to-many alignments.

EDIT: nvm modifying get_alignment_types or hardcoding second_alignment_types seems to have done the trick.