Closed bhaddow closed 3 years ago
Hi,
Yes, the third and fourth scores are the time-span scores which divide the OStt length(end_time - start_time) by a time span (default is 3000), if the length of the OStt lees than time span, these (third and fourth) scores will be equal.
In the first score, all complete segments in the candidate will be concatenated, and also this repeated for the reference, and the sacrebleu will be calculated.
And finally, in the second score, we use merSegmenter to resegment the candidate file (just complete segments), and then the sacrebleu will be calculated (the number of segments in the reference and output of the mwerSegmenter are equal).
Running sacrebleu externally, on complete sentences only, gives 28.6 for this data set.
Yes, it is because of tokenization.
Thanks, Mohammad
Do you mean that the third score is based on splitting the test set up into 3000 ms segments?
I do not understand the difference between the 1st and 4th scores.
Do you mean that the third score is based on splitting the test set up into 3000 ms segments?
Yes, in the third one, for each time span (default is 3000), tokens (in reference and candidate) that are in that time span would be extracted and, the BLEU score for each one would be calculated. For example, if the length of the OStt file was 5000, there are two-time spans:
0-3000
3000-5000
so, we have :
detailed sacreBLEU span-000000-003000 60
detailed sacreBLEU span-003000-005000 50
avg sacreBLEU span* 55
I do not understand the difference between the 1st and 4th scores.
In the first one, all sentences in reference are concatenated and considered as a document (the T table tokens in the paper) and, it will be repeated for complete segments of the candidate.
But in the 4th, the average of time-span BLEU scores would be taken.
Ah thanks, that makes sense. None of this is obvious from the output file, so is there a good place to add an explanation. Maybe in the output file itself?
I think the best place is in the header of output, So there is a header like as follow:
P ... considering Partial segments in delay and quality calculation (in addition to Complete segments)
T ... considering source Timestamps supplied with MT output
W ... segmenting by mWER segmenter (i.e. not segmenting by MT source timestamps)
A ... considering word alignment (by GIZA) to relax word delay (i.e. relaxing more than just linear delay calculation)
docAsWhole ...concatenating all reference segments and candidate complete segments as two documents
mwerSegmenter ...using mWER to resegment complete candidate segments according to reference segments
span-START-END ...the time span between START and END times (just tokens in the time-span considered)
span* ... average of all time-spans
------------------------------------------------------------------------------------------------------------
That seems good - but I don't see it. Do I have to add an extra option?
I did not add it to the codes yet. I will add it in the next version (1.2.1).
Do I have to add an extra option?
If you have any suggestions, please update or add some options as you want.
I see, I thought you meant that it was already there. Do the last 4 lines just pertain to the bleu score? I would put a line above them to say this.
I see, I thought you meant that it was already there.
Yes, just the last 4 lines do not exist in the scripts.
Do the last 4 lines just pertain to the bleu score?
Yes, the others pertain to the delay scores.
Hi,
I have updated the header in the new version (1.2.1).
Thanks, Mohammad
Thanks!
Hi
There are four different sacrebleu scores in the output:
I am guessing that (i) the third and fourth are the same; sentence bleu averaged across the document. (ii) the first is the whole document bleu; and (iii) the second is the "standard" sacrebleu (if the ref segmentation is the same as the SLT segmentation).
Is that correct? It's not really clear from the descriptions. I had to read the code.
Running sacrebleu externally, on complete sentences only, gives 28.6 for this data set.
best Barry