Closed yvesscherrer closed 3 years ago
This indeed seems to be some kind of design mistake. Producing tokenized output is undesired especially if WordAlignFilter
is used together with other filters, and it's not the last one in the pipeline.
The problem can be circumvented by outputting only the scores, and filtering the data based on them, but that's not often very convenient.
Just curious about the implementation: wouldn't pairs
in score()
and filter()
already contain the raw sentence pairs?
Yes, but they may be (and typically are) iterators/generators and not containers, so the pairs are not accessible after the first pass through the data. It would of course be possible to first convert them into lists, but I try to be careful not to expect that all the data fits into the memory. Thus I decided to store the contents into files.
I used WordAlignFilter on an untokenized dataset. My expectation was that I would have to indicate a tokenizer for Eflomal to work correctly, but that the final result would remain untokenized. However, the output came out tokenized. Is there a particular motivation for this behavior? I would find it preferable to avoid a tok-detok roundtrip as this is always a bit lossy.