Closed tomsbergmanis closed 2 years ago
Unfortunately eflomal does not document well what the scores actually are, but based on my earlier experiments as well as the unit tests with artificial data, they are some kind of costs (e.g. unnormalized negative log-probs) and thus lower value means more probable alignment. If you write the scores for real parallel data and some arbitrary sentence pairs, you should see this quite clearly. (If you see the opposite, then there must be something seriously wrong!)
Here are two examples:
-1.51689 -1.50699 This is the history of the generations of Isaac, Abraham's son. Abraham became the father of Isaac. Ja tämä on kertomus Iisakin, Aabrahamin pojan, suvusta. Aabrahamille syntyi Iisak.
4.22615 4.99098 When her days to be delivered were fulfilled, behold, there were twins in her womb. Kun hänen synnyttämisensä aika oli tullut, katso, hänen kohdussaan oli kaksoiset.
as far as I can tell the higher score is to sentence pairs that are better translations of each other. You can find other examples here: en-fi-bibel-eflomal-scores.txt This is how I got the scores:
common:
output_directory: work-en-fi
steps:
- type: opus_read
parameters:
corpus_name: bible-uedin
source_language: en
target_language: fi
release: latest
preprocessing: raw
src_output: en.raw.gz
tgt_output: fi.raw.gz
- type: filter
parameters:
inputs: [en.raw.gz, fi.raw.gz]
outputs: [en.train.gz, fi.train.gz]
filters:
- LongWordFilter: {}
- LengthFilter:
name: char
unit: char
- LengthFilter:
name: word
unit: word
- LengthRatioFilter:
name: char
unit: char
- LengthRatioFilter:
name: word
unit: word
- CharacterScoreFilter:
scripts: [Latin, Latin]
- LanguageIDFilter:
name: langid
id_method: langid
languages: [en, fi]
- LanguageIDFilter:
name: cld2
id_method: cld2
languages: [en, fi]
- TerminalPunctuationFilter: {}
- NonZeroNumeralsFilter: {}
- type: train_alignment
parameters:
src_data: en.train.gz
tgt_data: fi.train.gz
parameters:
src_tokenizer: [moses, en]
tgt_tokenizer: [moses, fi]
model: 3
output: align.priors
- type: score
parameters:
inputs: [en.train.gz, fi.train.gz]
#outputs: [en.final.gz, fi.final.gz]
output: en-fi.scored
filters:
- WordAlignFilter:
src_tokenizer: [moses, en]
tgt_tokenizer: [moses, fi]
model: 3
priors: align.priors
src_threshold: 0
tgt_threshold: 0
In your two examples, both are valid translations. Picking up a couple of false matches from your file:
3.76433 7.3568 "It happened on the eighth day, that Moses called Aaron and his sons, and the elders of Israel;" Kaikki esipihan ymp'rysverhot yltympäri olivat kerratuista valkoisista pellavalangoista,
3.25579 6.72012 But as for you, your dead bodies shall fall in this wilderness. mullikan, oinaan ja vuoden vanhan karitsan polttouhriksi,
Here's a configuration that calculates scores for a sample of real pairs and a samples of shuffled pairs:
common:
output_directory: work
steps:
- type: opus_read
parameters:
corpus_name: QED
source_language: fi
target_language: en
release: latest
preprocessing: raw
src_output: fi.raw.gz
tgt_output: en.raw.gz
- type: filter
parameters:
inputs: [fi.raw.gz, en.raw.gz]
outputs: [fi.train.gz, en.train.gz]
filters:
- LengthFilter:
unit: char
min_length: 10
max_length: 500
- LengthRatioFilter:
unit: char
threshold: 3
- type: train_alignment
parameters:
src_data: fi.train.gz
tgt_data: en.train.gz
parameters:
src_tokenizer: [moses, fi]
tgt_tokenizer: [moses, en]
model: 3
output: align.priors
- type: filter
parameters:
inputs: [fi.raw.gz, en.raw.gz]
outputs: [fi.all.gz, en.all.gz]
filters:
- LengthFilter:
unit: char
min_length: 1
- type: subset
parameters:
inputs: [fi.all.gz, en.all.gz]
outputs: [fi.matched.gz, en.matched.gz]
size: 1000
seed: 456
shuffle_subset: false
- type: subset
parameters:
inputs: [fi.all.gz, en.all.gz]
outputs: [fi.shuffled.gz, en.shuffled.gz]
size: 1000
seed: 123
shuffle_subset: true
- type: score
parameters:
inputs: [fi.matched.gz, en.matched.gz]
output: align_score_matched.jsonl
filters:
- WordAlignFilter:
src_tokenizer: [moses, fi]
tgt_tokenizer: [moses, en]
src_threshold: 0
tgt_threshold: 0
model: 3
priors: align.priors
- type: score
parameters:
inputs: [fi.shuffled.gz, en.shuffled.gz]
output: align_score_shuffled.jsonl
filters:
- WordAlignFilter:
src_tokenizer: [moses, fi]
tgt_tokenizer: [moses, en]
src_threshold: 0
tgt_threshold: 0
model: 3
priors: align.priors
Comparing the scores:
$ opusfilter-scores describe work/align_score_matched.jsonl
1000it [00:00, 234988.18it/s]
# WordAlignFilter.0
count 1000.000000
mean 0.366933
std 4.204393
min -11.664400
0.01% -11.377231
0.1% -8.792715
1% -7.508254
5% -6.227247
10% -5.112068
25% -2.590127
50% 0.272069
75% 3.001530
90% 5.942348
95% 7.685728
99% 10.797666
99.9% 14.631381
99.99% 15.603398
max 15.711400
Name: WordAlignFilter.0, dtype: float64
# WordAlignFilter.1
count 1000.000000
mean -1.479778
std 3.016303
min -9.402040
0.01% -9.377436
0.1% -9.155996
1% -6.339761
5% -5.234373
10% -4.551340
25% -3.519108
50% -2.046000
75% -0.091845
90% 2.308686
95% 4.135638
99% 8.436658
99.9% 14.444897
99.99% 16.600310
max 16.839800
Name: WordAlignFilter.1, dtype: float64
$ opusfilter-scores describe work/align_score_shuffled.jsonl
1000it [00:00, 233289.06it/s]
# WordAlignFilter.0
count 1000.000000
mean 7.850059
std 3.340384
min -3.070470
0.01% -2.967239
0.1% -2.038163
1% 0.870101
5% 3.091992
10% 3.799976
25% 5.462295
50% 7.580890
75% 9.959095
90% 12.322660
95% 13.593860
99% 16.328540
99.9% 18.277605
99.99% 18.821380
max 18.881800
Name: WordAlignFilter.0, dtype: float64
# WordAlignFilter.1
count 1000.000000
mean 7.345913
std 3.699149
min -2.910850
0.01% -2.817790
0.1% -1.980252
1% 1.078015
5% 2.852028
10% 3.621581
25% 4.992775
50% 6.766405
75% 8.896027
90% 11.586730
95% 13.849415
99% 18.407502
99.9% 41.865603
99.99% 42.407400
max 42.467600
Name: WordAlignFilter.1, dtype: float64
The shuffled data has clearly higher scores on the average.
Ok. This is convincing. The weird results I got probably was due to the small corpus then. Sorry about the bother and thanks for explaining! :)
No problem! As you have noticed, the eflomal scores are not very intuitive, there's a lot of variation also for valid translation pairs (as well as the random pairs). I wouldn't maybe use it alone for filtering, but as one classification feature it's useful. If you do, at least do not trust the default threshold (0) but check the data for a reasonable value.
Could it be that:
should be
instead? The current implementation filters our all sentences that have both scores lower than the threshold. https://github.com/Helsinki-NLP/OpusFilter/blob/37469ef18afd2d0eaad0bd311290d82f9995aae8/opusfilter/word_alignment.py#L146