bfsujason / bertalign

Multilingual sentence alignment using sentence embeddings
GNU General Public License v3.0
88 stars 40 forks source link

preprocessing suggestions #3

Open alvinntnu opened 1 year ago

alvinntnu commented 1 year ago

Thank you for creating this wonderful package. I just had a quick question about improving the accuracy of the alignment. Do you have any suggestions about text preprocessing, especially with symbols, punctuations? Would removing specific punctuation marks in texts have a great impact on the performance? Thanks!

bfsujason commented 1 year ago

I'm not sure whether removing punctuations could imporove the accuracy. It's very easy to give it a try though: just change the code in aligner.py and replace puctuations in the source and target sentences.

Instead of tweaking with preprocessing, I think using other sentence similarity measurements may improve the alignment accuracy. Now bertalign calculates similarity between sentence pairs based on sentence embeddings. However, recent studies (Zhang et al., 2019; Wang & Yu 2023) show that token-level similarity performs better in Semantic Textual Similarity tasks.

References

Wang, H. and Yu, D., 2023, July. Going Beyond Sentence Embeddings: A Token-Level Matching Algorithm for Calculating Semantic Textual Similarity. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 563-570).

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q. and Artzi, Y., 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.