Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
102 stars 18 forks source link

Specify different "unit" types in filters. #38

Closed BrightXiaoHan closed 2 years ago

BrightXiaoHan commented 2 years ago

I want to filter parallel corpus for "English-Chinese", but in "LengthFilter", "LengthRatioFilter", I can only specify one unit type.

Is it possible to config like this

min_length: [20, 10]
max_length: [100, 200]
unit: [char, word]
svirpioj commented 2 years ago

It's a good point that some of your languages in the parallel data might work better with character counts, while the others with word counts. I had not thought about it before.

I needed to think a bit how to do this nicely without breaking backwards compatibility (i.e. allowing also non-list input), but now it's there in the develop branch (https://github.com/Helsinki-NLP/OpusFilter/pull/40).

BrightXiaoHan commented 2 years ago

Thanks