Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

tokenizer ignored when creating align.priors #30

Closed tomsbergmanis closed 2 years ago

tomsbergmanis commented 2 years ago

I use code below to train word_alignments

- type: train_alignment
    parameters: 
      src_data: filtered.fi.gz
      tgt_data: filtered.en.gz
      parameters:
        src_tokenizer: [moses, fi]
        tgt_tokenizer: [moses, en]
        model: 3
      output: align.priors

but when I check align.priors I see that data was not tokenized:

LEX     !       "Hm!    1
LEX     !       "Misasja!"      2
LEX     !       "New    1
LEX     !       "Tõepoolest!"   1
LEX     !       "ei!    1
LEX     !       "kuninganna!    1
LEX     !       "tõepoolest!"   1

This seems like a bug, right?

svirpioj commented 2 years ago

I cannot replicate the problem. For example, with the following configuration, the produced align.priors looks tokenized - to the extent that the moses tokenizer can do. In any case the difference is very clear if you leave the *_tokenizer options out. Can you test this?

If it works, then I'd blame the tokenizer library and/or something weird in your data. (Your en side looks like Estonian, but I think the en settings for the tokenizer should mostly work.) If it does not, I need details on what software versions are you using.

common:
  output_directory: work

steps:
- type: opus_read
  parameters:
    corpus_name: QED
    source_language: fi
    target_language: en
    release: latest
    preprocessing: raw
    src_output: fi.raw.gz
    tgt_output: en.raw.gz

- type: filter
  parameters:
    inputs: [fi.raw.gz, en.raw.gz]
    outputs: [fi.train.gz, en.train.gz]
    filters:
      - LengthFilter:
          unit: char
          min_length: 10
          max_length: 500
      - LengthRatioFilter:
          unit: char
          threshold: 3

- type: train_alignment
  parameters:
    src_data: fi.train.gz
    tgt_data: en.train.gz
    parameters:
      src_tokenizer: [moses, fi]
      tgt_tokenizer: [moses, en]
      model: 3
    output: align.priors
tomsbergmanis commented 2 years ago

Thanks for your swift answer! I guess it was my mistake.