Closed BrightXiaoHan closed 2 years ago
It's a good point that some of your languages in the parallel data might work better with character counts, while the others with word counts. I had not thought about it before.
I needed to think a bit how to do this nicely without breaking backwards compatibility (i.e. allowing also non-list input), but now it's there in the develop
branch (https://github.com/Helsinki-NLP/OpusFilter/pull/40).
Thanks
I want to filter parallel corpus for "English-Chinese", but in "LengthFilter", "LengthRatioFilter", I can only specify one unit type.
Is it possible to config like this