Closed tomsbergmanis closed 2 years ago
I cannot replicate the problem. For example, with the following configuration, the produced align.priors
looks tokenized - to the extent that the moses tokenizer can do. In any case the difference is very clear if you leave the *_tokenizer
options out. Can you test this?
If it works, then I'd blame the tokenizer library and/or something weird in your data. (Your en
side looks like Estonian, but I think the en
settings for the tokenizer should mostly work.) If it does not, I need details on what software versions are you using.
common:
output_directory: work
steps:
- type: opus_read
parameters:
corpus_name: QED
source_language: fi
target_language: en
release: latest
preprocessing: raw
src_output: fi.raw.gz
tgt_output: en.raw.gz
- type: filter
parameters:
inputs: [fi.raw.gz, en.raw.gz]
outputs: [fi.train.gz, en.train.gz]
filters:
- LengthFilter:
unit: char
min_length: 10
max_length: 500
- LengthRatioFilter:
unit: char
threshold: 3
- type: train_alignment
parameters:
src_data: fi.train.gz
tgt_data: en.train.gz
parameters:
src_tokenizer: [moses, fi]
tgt_tokenizer: [moses, en]
model: 3
output: align.priors
Thanks for your swift answer! I guess it was my mistake.
I use code below to train word_alignments
but when I check align.priors I see that data was not tokenized:
This seems like a bug, right?