Closed abarbosa94 closed 2 years ago
First, are you using a shortlist? We recently discovered that the shortlist exacerbates this problem and we're working on ameliorating it. Second, this is a known issue, as with batching, sentences are reordered and that sometimes changes the results of the GEMMs and would mean that some subtle differences could cascade.
Finally, could you output the sources and the two different targets for the most different sentences (out of curiosity)
@abarbosa94 Thanks for the detailed information. Could you provide the commit-hash/versions of both marian-dev (assuming this is browsermt/marian-dev) and bergamot-translator that you worked with to generate these?
Second, this is a known issue, as with batching, sentences are reordered and that sometimes changes the results of the GEMMs and would mean that some subtle differences could cascade.
I'm not very familiar with marian-scorer but if I remember correctly you mentioned you're using a comparable setting with shortlists off
in both cases which seems to be available? Could you confirm this is the case?
I remember the primary query was a disparity in batched versus single-sample. However, I do not find a plot comparing both for bergamot-translator and marian-scorer independently. ie, batched-vs-single-sample scores for (1) bergamot-translator, (2) marian-dev. Is this not a concern at this point?
bergamot-translator and marian-scorer do not batch identically. This leads to floating-point approximations, differences cascading as Nick has mentioned generating different output and probability values at both. i.e, I would not expect marian-scorer and bergamot-translator to provide the same output or logprobs for a given input sequence.
Finally, could you output the sources and the two different targets for the most different sentences (out of curiosity)
+1 on the above request, if possible.
If there is a strong requirement for matching this closely (instead of some robustness strategies to variations), you might have to discard marian-scorer, use bergamot-translator to generate tokens and logprobs and train. I think this will end up with incompatibility with your human annotated training data (cc @mfomicheva). I will also suggest making the QE system robust by doing some adversarial distortions and that sort (thinking bergamot-translator as an adversary who might slightly corrupt inputs to your system).
Copying @graemenail who's involved with a few shortlist manipulations to help QE.
I'm not very familiar with marian-scorer but if I remember correctly you mentioned you're using a comparable setting with shortlists off in both cases which seems to be available? Could you confirm this is the case?
Yes, we disabled the shortlist in both cases.
Could you provide the commit-hash/versions of both marian-dev (assuming this is browsermt/marian-dev) and bergamot-translator that you worked with to generate these?
Bergamot : https://github.com/browsermt/bergamot-translator/tree/1231057c36afe3e297a55bfdc58dc5bb559591d9 Marian-dev : https://github.com/browsermt/marian-dev/tree/62bac858bfd37060beb707d12eb9711649ea4cf6
Finally, could you output the sources and the two different targets for the most different sentences (out of curiosity)
The top 4 sentences with highest differences:
@abarbosa94 There seems to be differences in tokens in single-line vs batch inference. The differences are expected, and there is no way we can get rid of it.
I'm surprised as to how in such cases how are you are matching scores to different tokens as provided in the first figure? (Edit: Never-mind, it's between marian-scorer and bergamot) The score differences can have a cascading effect on the tokens right of where the difference starts.
In any case, there is no solution to the batch-size disparity. Note that while mini-batch-words
at marian-scorer and bergamot-translator are conceptually similar they're totally different implementations, and certain wraps and sentence splitting at bergamot almost entirely implies you will never be able to match batch-size in bergamot-translator for another batch-size in marian-scorer through the limited CLI configurability you have.
This is a wont-fix (It's too much effort to try and match marian-scorer and bergamot-translator for this purpose, not worth the benefits), and I strongly recommend you look towards the following:
I will also suggest making the QE system robust by doing some adversarial distortions and that sort (thinking bergamot-translator as an adversary who might slightly corrupt inputs to your system).
We should be turning up the lexical shortlist size. It's too small for a single sentence. Won't make it constant though.
Just to make sure we are on the same page. @jerinphilip the problem is not that we get different translations from bergamot-translator vs. from marian-dev, or that we get different translations depending on the batch size. That's totally fine. The issue that worries me is that we get very different log-probs when for the exact same translations. @abarbosa94 will provide a concrete example of that to illustrate.
The issue that worries me is that we get very different log-probs when for the exact same translations.
Let me ask this, if you get variations in batching, why is the possibility that the above is the case surprising?
Think of a two-class classification case. If matrix multiplications can create variations like 0.5001
, 0.4999
with a decision rule threshold of 0.5
(thus creating a difference in tokens), is it not possible to have a difference of 0.5001
vs 0.5011
(i.e same token, different probabilities)?
Note that the errors have a compounding effect as well across time-steps (to be large enough), but still can potentially lead to the same tokens.
y'_{t} = argmax p(y_{t} | encoder-representations, y_{<t}; \theta)
(@graemenail: Can you please help me with the expected error here as decoding moves from t = 1 to t = N; I guess if what @abarbosa94 shows as a concrete example is within ranges of the error we may expect that the underlying system is okay. The error stems from floating-point representation bit bottleneck I think, which is analogous to the precision of measurements, unsure how to model it).
Let's turn up the shortlist size to at least 200. That will decrease the noise in the scores and probably fix some of the quality issues we see when decoding only one sentence.
Jerin has pointed out my reading scores are poor and it's said above "Yes, we disabled the shortlist in both cases." Can you confirm the exact configuration you ran with, especially the int8 settings?
I guess if what @abarbosa94 shows as a concrete example is within ranges of the error we may
I have chosen four sentences:
We then generated Estonian translations through bergamot. Let's pick the translated sentence for The North American river otters favor bog lakes with banked shores containing semiaquatic mammal burrows and lakes with beaver lodges
, which resulted in
Põhja-Ameerika jõe otterid eelistavad kaldaga järvi, kus on poolveelised imetajate urtsad ja järved koos peekoniga
If we pick the bpe tokens with the respective log probabilitites, we have the following:
[bpe]:(Põhja)(-)(Ameerika)( jõe)( )(ott)(er)(id)( eelistavad)( ka)(lda)(ga)( )(jä)(rvi)(,)( kus)( on)( pool)(ve)(e)(lised)( ime)(tajate)( )(ur)(tsa)(d)( ja)( järve)(d)( koos)( pe)(e)(kon)(iga)(.)()
[logProbs]:-0.00221 -0.00040 -0.02741 -0.02908 -1.11370 -0.00602 -0.86121 -0.00420 -0.64900 -0.73534 -1.14354 -1.14991 -0.66166 -0.17526 -0.04825 -0.07876 -0.66660 -0.25786 -0.13927 -0.11325 -0.77636 -1.88491 -0.02620 -0.87077 -0.58120 -0.56390 -0.34163 -0.00735 -0.01723 -0.13466 -0.00309 -0.76170 -0.68970 -1.21513 -0.57574 -0.44229 -0.03489 -0.00002
If we take the same bpe tokens and use them as input for marian-scorer, we have the following:
[bpe]:(▁Põhja)(-)(Ameerika)(▁jõe)(▁)(ott)(er)(id)(▁eelistavad)(▁ka)(lda)(ga)(▁)(jä)(rvi)(,)(▁kus)(▁on)(▁pool)(ve)(e)(lised)(▁ime)(tajate)(▁)(ur)(tsa)(d)(▁ja)(▁järve)(d)(▁koos)(▁pe)(e)(kon)(iga)(.)(</s>)
[logProbs]:-0.00470 -0.00012 -0.04183 -0.02197 -1.56387 -0.02575 -1.56785 -0.01642 -1.53254 -0.67023 -1.32163 -0.85075 -0.60641 -0.08832 -0.00423 -0.09008 -0.68191 -0.16601 -0.15247 -0.06087 -0.38883 -2.52734 -0.03363 -1.05277 -0.47952 -0.50283 -0.20823 -0.00762 -0.03586 -0.23022 -0.00315 -1.07891 -0.56034 -1.18140 -0.42146 -0.56498 -0.06203 -0.00001
As you can see, the same bpe tokes are used. However, some tokens have a considerable difference in values. For instance, in bergamot log-probability for token er
(position 6) was -0.8612
whereas in marian, the same token had -1.5678
as value. We want to know if this 0.7 difference is expected or if we can do something about this.
@abarbosa94 are the model and vocab publically available somewhere?
@graemenail we used the models and vocabs available here: http://data.statmt.org/bergamot/models/eten/
More specifically, we have used the english-estonian one: http://data.statmt.org/bergamot/models/eten/enet.student.tiny11.tar.gz
My understanding is that this is sorted sometime alongside #251. Closing. Please re-open if there is something unresolved.
We did some analysis between
bergamot
scores outputs and compare them against the ones returned bymarian-scorer
and we noticed that these differed are non-negligible if we increase the batch size. The workflow was the followingsingle line inference:
Batch inference
Configuration files
Bergamot
Marian scorer
Sample sentences
The sentences below are the ones where we noticed the largest difference amount