preprocessing suggestions

I'm not sure whether removing punctuations could imporove the accuracy. It's very easy to give it a try though: just change the code in aligner.py and replace puctuations in the source and target sentences.

Instead of tweaking with preprocessing, I think using other sentence similarity measurements may improve the alignment accuracy. Now bertalign calculates similarity between sentence pairs based on sentence embeddings. However, recent studies (Zhang et al., 2019; Wang & Yu 2023) show that token-level similarity performs better in Semantic Textual Similarity tasks.

References

Wang, H. and Yu, D., 2023, July. Going Beyond Sentence Embeddings: A Token-Level Matching Algorithm for Calculating Semantic Textual Similarity. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 563-570).

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q. and Artzi, Y., 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.

bfsujason / bertalign

preprocessing suggestions #3