ELITR / SLTev

SLTev is a tool for comprehensive evaluation of (simultaneous) spoken language translation.
8 stars 3 forks source link

What are the different sacrebleu scores? #64

Closed bhaddow closed 3 years ago

bhaddow commented 3 years ago

Hi

There are four different sacrebleu scores in the output:

tot      sacreBLEU     docAsWhole            37.309
avg      sacreBLEU     mwerSegmenter          28.710
detailed sacreBLEU     span-000000-003000     32.876
avg      sacreBLEU     span*                  32.876

I am guessing that (i) the third and fourth are the same; sentence bleu averaged across the document. (ii) the first is the whole document bleu; and (iii) the second is the "standard" sacrebleu (if the ref segmentation is the same as the SLT segmentation).

Is that correct? It's not really clear from the descriptions. I had to read the code.

Running sacrebleu externally, on complete sentences only, gives 28.6 for this data set.

best Barry

mohammad2928 commented 3 years ago

Hi,

Yes, the third and fourth scores are the time-span scores which divide the OStt length(end_time - start_time) by a time span (default is 3000), if the length of the OStt lees than time span, these (third and fourth) scores will be equal.

In the first score, all complete segments in the candidate will be concatenated, and also this repeated for the reference, and the sacrebleu will be calculated.

And finally, in the second score, we use merSegmenter to resegment the candidate file (just complete segments), and then the sacrebleu will be calculated (the number of segments in the reference and output of the mwerSegmenter are equal).

Running sacrebleu externally, on complete sentences only, gives 28.6 for this data set.

Yes, it is because of tokenization.

Thanks, Mohammad

bhaddow commented 3 years ago

Do you mean that the third score is based on splitting the test set up into 3000 ms segments?

I do not understand the difference between the 1st and 4th scores.

mohammad2928 commented 3 years ago

Do you mean that the third score is based on splitting the test set up into 3000 ms segments?

Yes, in the third one, for each time span (default is 3000), tokens (in reference and candidate) that are in that time span would be extracted and, the BLEU score for each one would be calculated. For example, if the length of the OStt file was 5000, there are two-time spans:

0-3000
3000-5000

so, we have :

detailed sacreBLEU     span-000000-003000     60
detailed sacreBLEU     span-003000-005000     50
avg      sacreBLEU     span*                  55

I do not understand the difference between the 1st and 4th scores.

In the first one, all sentences in reference are concatenated and considered as a document (the T table tokens in the paper) and, it will be repeated for complete segments of the candidate.
But in the 4th, the average of time-span BLEU scores would be taken.

bhaddow commented 3 years ago

Ah thanks, that makes sense. None of this is obvious from the output file, so is there a good place to add an explanation. Maybe in the output file itself?

mohammad2928 commented 3 years ago

I think the best place is in the header of output, So there is a header like as follow:

P ... considering Partial segments in delay and quality calculation (in addition to Complete segments)
T ... considering source Timestamps supplied with MT output
W ... segmenting by mWER segmenter (i.e. not segmenting by MT source timestamps)
A ... considering word alignment (by GIZA) to relax word delay (i.e. relaxing more than just linear delay calculation)
docAsWhole ...concatenating all reference segments and candidate complete segments as two documents 
mwerSegmenter ...using mWER to resegment complete candidate segments according to reference segments
span-START-END  ...the time span between START and END times (just tokens in the time-span considered)
span* ... average of all time-spans       
------------------------------------------------------------------------------------------------------------
bhaddow commented 3 years ago

That seems good - but I don't see it. Do I have to add an extra option?

mohammad2928 commented 3 years ago

I did not add it to the codes yet. I will add it in the next version (1.2.1).

Do I have to add an extra option?

If you have any suggestions, please update or add some options as you want.

bhaddow commented 3 years ago

I see, I thought you meant that it was already there. Do the last 4 lines just pertain to the bleu score? I would put a line above them to say this.

mohammad2928 commented 3 years ago

I see, I thought you meant that it was already there.

Yes, just the last 4 lines do not exist in the scripts.

Do the last 4 lines just pertain to the bleu score?

Yes, the others pertain to the delay scores.

mohammad2928 commented 3 years ago

Hi,

I have updated the header in the new version (1.2.1).

Thanks, Mohammad

bhaddow commented 3 years ago

Thanks!