Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
102 stars 18 forks source link

Possible bug in word_alignment accept function #31

Closed tomsbergmanis closed 2 years ago

tomsbergmanis commented 2 years ago

Could it be that:

    def accept(self, score):
        return score[0] < self.src_threshold and score[1] < self.tgt_threshold

should be

    def accept(self, score):
        return score[0] > self.src_threshold and score[1] > self.tgt_threshold

instead? The current implementation filters our all sentences that have both scores lower than the threshold. https://github.com/Helsinki-NLP/OpusFilter/blob/37469ef18afd2d0eaad0bd311290d82f9995aae8/opusfilter/word_alignment.py#L146

svirpioj commented 2 years ago

Unfortunately eflomal does not document well what the scores actually are, but based on my earlier experiments as well as the unit tests with artificial data, they are some kind of costs (e.g. unnormalized negative log-probs) and thus lower value means more probable alignment. If you write the scores for real parallel data and some arbitrary sentence pairs, you should see this quite clearly. (If you see the opposite, then there must be something seriously wrong!)

tomsbergmanis commented 2 years ago

Here are two examples:

-1.51689    -1.50699    This is the history of the generations of Isaac, Abraham's son. Abraham became the father of Isaac. Ja tämä on kertomus Iisakin, Aabrahamin pojan, suvusta. Aabrahamille syntyi Iisak.
4.22615 4.99098 When her days to be delivered were fulfilled, behold, there were twins in her womb. Kun hänen synnyttämisensä aika oli tullut, katso, hänen kohdussaan oli kaksoiset.

as far as I can tell the higher score is to sentence pairs that are better translations of each other. You can find other examples here: en-fi-bibel-eflomal-scores.txt This is how I got the scores:

common:
  output_directory: work-en-fi
steps:
- type: opus_read
  parameters:
    corpus_name: bible-uedin
    source_language: en
    target_language: fi
    release: latest
    preprocessing: raw
    src_output: en.raw.gz
    tgt_output: fi.raw.gz
- type: filter
  parameters:
    inputs: [en.raw.gz, fi.raw.gz]
    outputs: [en.train.gz, fi.train.gz]
    filters:
      - LongWordFilter: {}
      - LengthFilter:
          name: char
          unit: char
      - LengthFilter:
          name: word
          unit: word
      - LengthRatioFilter:
          name: char
          unit: char
      - LengthRatioFilter:
          name: word
          unit: word
      - CharacterScoreFilter:
          scripts: [Latin, Latin]
      - LanguageIDFilter:
          name: langid
          id_method: langid
          languages: [en, fi]
      - LanguageIDFilter:
          name: cld2
          id_method: cld2
          languages: [en, fi]
      - TerminalPunctuationFilter: {}
      - NonZeroNumeralsFilter: {}

- type: train_alignment
  parameters:
    src_data: en.train.gz
    tgt_data: fi.train.gz
    parameters:
      src_tokenizer: [moses, en]
      tgt_tokenizer: [moses, fi]
      model: 3
    output: align.priors

- type: score
  parameters:
    inputs: [en.train.gz, fi.train.gz]
    #outputs: [en.final.gz, fi.final.gz]
    output: en-fi.scored
    filters:
      - WordAlignFilter:
          src_tokenizer: [moses, en]
          tgt_tokenizer: [moses, fi]
          model: 3
          priors: align.priors
          src_threshold: 0
          tgt_threshold: 0
svirpioj commented 2 years ago

In your two examples, both are valid translations. Picking up a couple of false matches from your file:

3.76433 7.3568  "It happened on the eighth day, that Moses called Aaron and his sons, and the elders of Israel;"        Kaikki esipihan ymp'rysverhot yltympäri olivat kerratuista valkoisista pellavalangoista,
3.25579 6.72012 But as for you, your dead bodies shall fall in this wilderness. mullikan, oinaan ja vuoden vanhan karitsan polttouhriksi,

Here's a configuration that calculates scores for a sample of real pairs and a samples of shuffled pairs:

common:
  output_directory: work

steps:
- type: opus_read
  parameters:
    corpus_name: QED
    source_language: fi
    target_language: en
    release: latest
    preprocessing: raw
    src_output: fi.raw.gz
    tgt_output: en.raw.gz

- type: filter
  parameters:
    inputs: [fi.raw.gz, en.raw.gz]
    outputs: [fi.train.gz, en.train.gz]
    filters:
      - LengthFilter:
          unit: char
          min_length: 10
          max_length: 500
      - LengthRatioFilter:
          unit: char
          threshold: 3

- type: train_alignment
  parameters:
    src_data: fi.train.gz
    tgt_data: en.train.gz
    parameters:
      src_tokenizer: [moses, fi]
      tgt_tokenizer: [moses, en]
      model: 3
    output: align.priors

- type: filter
  parameters:
    inputs: [fi.raw.gz, en.raw.gz]
    outputs: [fi.all.gz, en.all.gz]
    filters:
      - LengthFilter:
          unit: char
          min_length: 1

- type: subset
  parameters:
    inputs: [fi.all.gz, en.all.gz]
    outputs: [fi.matched.gz, en.matched.gz]
    size: 1000
    seed: 456
    shuffle_subset: false

- type: subset
  parameters:
    inputs: [fi.all.gz, en.all.gz]
    outputs: [fi.shuffled.gz, en.shuffled.gz]
    size: 1000
    seed: 123
    shuffle_subset: true

- type: score
  parameters:
    inputs: [fi.matched.gz, en.matched.gz]
    output: align_score_matched.jsonl
    filters:
      - WordAlignFilter:
          src_tokenizer: [moses, fi]
          tgt_tokenizer: [moses, en]
          src_threshold: 0 
          tgt_threshold: 0
          model: 3
          priors: align.priors

- type: score
  parameters:
    inputs: [fi.shuffled.gz, en.shuffled.gz]
    output: align_score_shuffled.jsonl
    filters:
      - WordAlignFilter:
          src_tokenizer: [moses, fi]
          tgt_tokenizer: [moses, en]
          src_threshold: 0 
          tgt_threshold: 0
          model: 3
          priors: align.priors

Comparing the scores:

$ opusfilter-scores describe work/align_score_matched.jsonl 
1000it [00:00, 234988.18it/s]
# WordAlignFilter.0
count     1000.000000
mean         0.366933
std          4.204393
min        -11.664400
0.01%      -11.377231
0.1%        -8.792715
1%          -7.508254
5%          -6.227247
10%         -5.112068
25%         -2.590127
50%          0.272069
75%          3.001530
90%          5.942348
95%          7.685728
99%         10.797666
99.9%       14.631381
99.99%      15.603398
max         15.711400
Name: WordAlignFilter.0, dtype: float64

# WordAlignFilter.1
count     1000.000000
mean        -1.479778
std          3.016303
min         -9.402040
0.01%       -9.377436
0.1%        -9.155996
1%          -6.339761
5%          -5.234373
10%         -4.551340
25%         -3.519108
50%         -2.046000
75%         -0.091845
90%          2.308686
95%          4.135638
99%          8.436658
99.9%       14.444897
99.99%      16.600310
max         16.839800
Name: WordAlignFilter.1, dtype: float64

$ opusfilter-scores describe work/align_score_shuffled.jsonl 
1000it [00:00, 233289.06it/s]
# WordAlignFilter.0
count     1000.000000
mean         7.850059
std          3.340384
min         -3.070470
0.01%       -2.967239
0.1%        -2.038163
1%           0.870101
5%           3.091992
10%          3.799976
25%          5.462295
50%          7.580890
75%          9.959095
90%         12.322660
95%         13.593860
99%         16.328540
99.9%       18.277605
99.99%      18.821380
max         18.881800
Name: WordAlignFilter.0, dtype: float64

# WordAlignFilter.1
count     1000.000000
mean         7.345913
std          3.699149
min         -2.910850
0.01%       -2.817790
0.1%        -1.980252
1%           1.078015
5%           2.852028
10%          3.621581
25%          4.992775
50%          6.766405
75%          8.896027
90%         11.586730
95%         13.849415
99%         18.407502
99.9%       41.865603
99.99%      42.407400
max         42.467600
Name: WordAlignFilter.1, dtype: float64

The shuffled data has clearly higher scores on the average.

tomsbergmanis commented 2 years ago

Ok. This is convincing. The weird results I got probably was due to the small corpus then. Sorry about the bother and thanks for explaining! :)

svirpioj commented 2 years ago

No problem! As you have noticed, the eflomal scores are not very intuitive, there's a lot of variation also for valid translation pairs (as well as the random pairs). I wouldn't maybe use it alone for filtering, but as one classification feature it's useful. If you do, at least do not trust the default threshold (0) but check the data for a reasonable value.