bfsujason / bertalign

Multilingual sentence alignment using sentence embeddings
GNU General Public License v3.0
91 stars 42 forks source link

Change parametrization #5

Open ChristianGeng opened 1 year ago

ChristianGeng commented 1 year ago

Thanks for the nice package! Following up on the issue concerning preprocessing suggestions, I have implemented an alternative parametrization that I would like to discuss. I hope you have time to discuss this,

Parametrize the shape in which text comes differently

The implementation has been used the is_split boolean flag to determine the form in which the input comes along.

As discussed already in the issue concerning preprocessing suggestions, it sometimes might be useful to have other options in which the data are passed to Bertalign.

In my special case it turns out that it is better to pass over src and target as lists. This comes from the fact that I need to postprocess the data outside of Bertalign. Passing lists avoids some idempotency issues that I have seen. I am not going into deta\ ils here can of course if needed.

So I would feel better to reparamrtrize the is_split into a (ternary) split_type option:

split_type is_split equivalent description default
raw is_split=False splitting has to be done *
lines is_split=True
tokenized n.a. tokenized sentences as lists

Parametrize src and target languanges differently

Allow pass get src and target languages as parameters. The current implementation relies on google translate to detect language id which is an external dependency.

In order not to remove it, I have added very inelegant code that keeps the parametrization intact as much as possible

Afaics, the language id is only used when using split_type=='lines' resp. is_split=True. So maybe there is a better alternative?

Tests

I have also added basic unit tests to show that the parametrization is ok . These can be run using pytest -sv tests/test_results.py after having the test requirements installed, assuming that the package is installed - what I have done using pip install -e \ ..