browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
http://browser.mt
Mozilla Public License 2.0
337 stars 38 forks source link

Different LogProb for the same BPE token w.r.t. Bergamot vs. Marian-Scorer #244

Closed abarbosa94 closed 2 years ago

abarbosa94 commented 2 years ago

We did some analysis between bergamot scores outputs and compare them against the ones returned by marian-scorer and we noticed that these differed are non-negligible if we increase the batch size. The workflow was the following

single line inference:

image

image

Batch inference

Configuration files

Bergamot

bergamot-mode: native
ssplit-mode: sentence 
relative-paths: true
models: 
  - model.intgemm.alphas.bin
vocabs:
  - vocab.eten.spm
  - vocab.eten.spm
beam-size: 1
mini-batch-words: 128
skip-cost: false

Marian scorer

relative-paths: true
model:  model.intgemm.alphas.bin
vocabs:
  - vocab.eten.spm
  - vocab.eten.spm
word-scores: true
mini-batch-words: 128

Sample sentences

The sentences below are the ones where we noticed the largest difference amount

Campbell 's Stores comprise eleven gable fronted , three storey high rectangular plan bays .
While imitating a Southern drawl , Biden remarked " I was in a caucus with James O. Eastland .
The North American river otters favor bog lakes with banked shores containing semiaquatic mammal burrows and lakes with beaver lodges .
The [ [ Bandy Federation of India ] ] governs bandy in [ [ India ] ] .
XapaJIaMnu commented 2 years ago

First, are you using a shortlist? We recently discovered that the shortlist exacerbates this problem and we're working on ameliorating it. Second, this is a known issue, as with batching, sentences are reordered and that sometimes changes the results of the GEMMs and would mean that some subtle differences could cascade.

Finally, could you output the sources and the two different targets for the most different sentences (out of curiosity)

jerinphilip commented 2 years ago

@abarbosa94 Thanks for the detailed information. Could you provide the commit-hash/versions of both marian-dev (assuming this is browsermt/marian-dev) and bergamot-translator that you worked with to generate these?

Second, this is a known issue, as with batching, sentences are reordered and that sometimes changes the results of the GEMMs and would mean that some subtle differences could cascade.

I'm not very familiar with marian-scorer but if I remember correctly you mentioned you're using a comparable setting with shortlists off in both cases which seems to be available? Could you confirm this is the case?

I remember the primary query was a disparity in batched versus single-sample. However, I do not find a plot comparing both for bergamot-translator and marian-scorer independently. ie, batched-vs-single-sample scores for (1) bergamot-translator, (2) marian-dev. Is this not a concern at this point?

bergamot-translator and marian-scorer do not batch identically. This leads to floating-point approximations, differences cascading as Nick has mentioned generating different output and probability values at both. i.e, I would not expect marian-scorer and bergamot-translator to provide the same output or logprobs for a given input sequence.

Finally, could you output the sources and the two different targets for the most different sentences (out of curiosity)

+1 on the above request, if possible.

If there is a strong requirement for matching this closely (instead of some robustness strategies to variations), you might have to discard marian-scorer, use bergamot-translator to generate tokens and logprobs and train. I think this will end up with incompatibility with your human annotated training data (cc @mfomicheva). I will also suggest making the QE system robust by doing some adversarial distortions and that sort (thinking bergamot-translator as an adversary who might slightly corrupt inputs to your system).

Copying @graemenail who's involved with a few shortlist manipulations to help QE.

abarbosa94 commented 2 years ago

I'm not very familiar with marian-scorer but if I remember correctly you mentioned you're using a comparable setting with shortlists off in both cases which seems to be available? Could you confirm this is the case?

Yes, we disabled the shortlist in both cases.

Could you provide the commit-hash/versions of both marian-dev (assuming this is browsermt/marian-dev) and bergamot-translator that you worked with to generate these?

Bergamot : https://github.com/browsermt/bergamot-translator/tree/1231057c36afe3e297a55bfdc58dc5bb559591d9 Marian-dev : https://github.com/browsermt/marian-dev/tree/62bac858bfd37060beb707d12eb9711649ea4cf6

Finally, could you output the sources and the two different targets for the most different sentences (out of curiosity)

The top 4 sentences with highest differences:

Original

Single line inference

Batch inference

jerinphilip commented 2 years ago

@abarbosa94 There seems to be differences in tokens in single-line vs batch inference. The differences are expected, and there is no way we can get rid of it.

image

I'm surprised as to how in such cases how are you are matching scores to different tokens as provided in the first figure? (Edit: Never-mind, it's between marian-scorer and bergamot) The score differences can have a cascading effect on the tokens right of where the difference starts.

In any case, there is no solution to the batch-size disparity. Note that while mini-batch-words at marian-scorer and bergamot-translator are conceptually similar they're totally different implementations, and certain wraps and sentence splitting at bergamot almost entirely implies you will never be able to match batch-size in bergamot-translator for another batch-size in marian-scorer through the limited CLI configurability you have.

This is a wont-fix (It's too much effort to try and match marian-scorer and bergamot-translator for this purpose, not worth the benefits), and I strongly recommend you look towards the following:

I will also suggest making the QE system robust by doing some adversarial distortions and that sort (thinking bergamot-translator as an adversary who might slightly corrupt inputs to your system).

kpu commented 2 years ago

We should be turning up the lexical shortlist size. It's too small for a single sentence. Won't make it constant though.

mfomicheva commented 2 years ago

Just to make sure we are on the same page. @jerinphilip the problem is not that we get different translations from bergamot-translator vs. from marian-dev, or that we get different translations depending on the batch size. That's totally fine. The issue that worries me is that we get very different log-probs when for the exact same translations. @abarbosa94 will provide a concrete example of that to illustrate.

jerinphilip commented 2 years ago

The issue that worries me is that we get very different log-probs when for the exact same translations.

Let me ask this, if you get variations in batching, why is the possibility that the above is the case surprising?

Think of a two-class classification case. If matrix multiplications can create variations like 0.5001, 0.4999 with a decision rule threshold of 0.5 (thus creating a difference in tokens), is it not possible to have a difference of 0.5001 vs 0.5011 (i.e same token, different probabilities)?

Note that the errors have a compounding effect as well across time-steps (to be large enough), but still can potentially lead to the same tokens.

y'_{t} = argmax p(y_{t} | encoder-representations, y_{<t}; \theta)

(@graemenail: Can you please help me with the expected error here as decoding moves from t = 1 to t = N; I guess if what @abarbosa94 shows as a concrete example is within ranges of the error we may expect that the underlying system is okay. The error stems from floating-point representation bit bottleneck I think, which is analogous to the precision of measurements, unsure how to model it).

kpu commented 2 years ago

Let's turn up the shortlist size to at least 200. That will decrease the noise in the scores and probably fix some of the quality issues we see when decoding only one sentence.

kpu commented 2 years ago

Jerin has pointed out my reading scores are poor and it's said above "Yes, we disabled the shortlist in both cases." Can you confirm the exact configuration you ran with, especially the int8 settings?

abarbosa94 commented 2 years ago

I guess if what @abarbosa94 shows as a concrete example is within ranges of the error we may

I have chosen four sentences:

We then generated Estonian translations through bergamot. Let's pick the translated sentence for The North American river otters favor bog lakes with banked shores containing semiaquatic mammal burrows and lakes with beaver lodges , which resulted in Põhja-Ameerika jõe otterid eelistavad kaldaga järvi, kus on poolveelised imetajate urtsad ja järved koos peekoniga

If we pick the bpe tokens with the respective log probabilitites, we have the following:

[bpe]:(Põhja)(-)(Ameerika)( jõe)( )(ott)(er)(id)( eelistavad)( ka)(lda)(ga)( )(jä)(rvi)(,)( kus)( on)( pool)(ve)(e)(lised)( ime)(tajate)( )(ur)(tsa)(d)( ja)( järve)(d)( koos)( pe)(e)(kon)(iga)(.)()
[logProbs]:-0.00221 -0.00040 -0.02741 -0.02908 -1.11370 -0.00602 -0.86121 -0.00420 -0.64900 -0.73534 -1.14354 -1.14991 -0.66166 -0.17526 -0.04825 -0.07876 -0.66660 -0.25786 -0.13927 -0.11325 -0.77636 -1.88491 -0.02620 -0.87077 -0.58120 -0.56390 -0.34163 -0.00735 -0.01723 -0.13466 -0.00309 -0.76170 -0.68970 -1.21513 -0.57574 -0.44229 -0.03489 -0.00002

If we take the same bpe tokens and use them as input for marian-scorer, we have the following:

[bpe]:(▁Põhja)(-)(Ameerika)(▁jõe)(▁)(ott)(er)(id)(▁eelistavad)(▁ka)(lda)(ga)(▁)(jä)(rvi)(,)(▁kus)(▁on)(▁pool)(ve)(e)(lised)(▁ime)(tajate)(▁)(ur)(tsa)(d)(▁ja)(▁järve)(d)(▁koos)(▁pe)(e)(kon)(iga)(.)(</s>)
[logProbs]:-0.00470 -0.00012 -0.04183 -0.02197 -1.56387 -0.02575 -1.56785 -0.01642 -1.53254 -0.67023 -1.32163 -0.85075 -0.60641 -0.08832 -0.00423 -0.09008 -0.68191 -0.16601 -0.15247 -0.06087 -0.38883 -2.52734 -0.03363 -1.05277 -0.47952 -0.50283 -0.20823 -0.00762 -0.03586 -0.23022 -0.00315 -1.07891 -0.56034 -1.18140 -0.42146 -0.56498 -0.06203 -0.00001 

As you can see, the same bpe tokes are used. However, some tokens have a considerable difference in values. For instance, in bergamot log-probability for token er (position 6) was -0.8612 whereas in marian, the same token had -1.5678 as value. We want to know if this 0.7 difference is expected or if we can do something about this.

graemenail commented 2 years ago

@abarbosa94 are the model and vocab publically available somewhere?

abarbosa94 commented 2 years ago

@graemenail we used the models and vocabs available here: http://data.statmt.org/bergamot/models/eten/

More specifically, we have used the english-estonian one: http://data.statmt.org/bergamot/models/eten/enet.student.tiny11.tar.gz

jerinphilip commented 2 years ago

My understanding is that this is sorted sometime alongside #251. Closing. Please re-open if there is something unresolved.