Tokenization behavior in WordAlignFilter

Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit

MIT License

101 stars 18 forks source link

Tokenization behavior in WordAlignFilter #9

Closed yvesscherrer closed 3 years ago

yvesscherrer commented 3 years ago

I used WordAlignFilter on an untokenized dataset. My expectation was that I would have to indicate a tokenizer for Eflomal to work correctly, but that the final result would remain untokenized. However, the output came out tokenized. Is there a particular motivation for this behavior? I would find it preferable to avoid a tok-detok roundtrip as this is always a bit lossy.

svirpioj commented 3 years ago

This indeed seems to be some kind of design mistake. Producing tokenized output is undesired especially if WordAlignFilter is used together with other filters, and it's not the last one in the pipeline.

The problem can be circumvented by outputting only the scores, and filtering the data based on them, but that's not often very convenient.

svirpioj commented 3 years ago

Fixed in develop (https://github.com/Helsinki-NLP/OpusFilter/commit/04a4f85d9e5ce428199519623bc5d2542521edde).

yvesscherrer commented 3 years ago

Just curious about the implementation: wouldn't pairs in score() and filter() already contain the raw sentence pairs?

svirpioj commented 3 years ago

Yes, but they may be (and typically are) iterators/generators and not containers, so the pairs are not accessible after the first pass through the data. It would of course be possible to first convert them into lists, but I try to be careful not to expect that all the data fits into the memory. Thus I decided to store the contents into files.