question about bleu-4 - Githubissues

ItsBean commented 1 year ago

I have read your paper "Using Deep Learning to Generate Complete Log Statements", and i wonder how you calculate the bleu-4 metric for the log message.

1.Do you use the nltk package for the bleu-4 metric calculation?

2.and if so, .do you calculate the corpus_bleu or the average of sentence_bleu, because there are two ways to use in the nltk package,: from nltk.translate.bleu_score import corpus_bleu from nltk.translate.bleu_score import sentence_bleu

3.and how you calculate the bleu-4, is the way like: corpus_bleu(refs, preds, weights=(0.25, 0.25, 0.25, 0.25)), the default weights used by the nltk.

or the way like: corpus_bleu(refs, preds, weights=(0, 0, 0, 1)), to calculate the individual score of 4-gram overlap.

antonio-mastropaolo commented 1 year ago

Hi @ItsBean

You can use the multi-bleu.pl script you can find at the following link: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

ItsBean commented 1 year ago

Thank you very much!

I have tried the multi-bleu.perl script, and the output is like this:

It is not advisable to publish scores from multi-bleu.perl. The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups. Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization. Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer. BLEU = 13.56, 51.6/25.0/14.0/9.3 (BP=0.668, ratio=0.713, hyp_len=107012, ref_len=150121)

I am not sure what's the meaning of "belu-4" in the paper, do you report the value 13.56 or the value 9.3？

I found lots of papers report the value 13.56

And also, do you have any tokenize preprocess before calculating the bleu using the multi-bleu.perl script?

antonio-mastropaolo commented 1 year ago

Hi @ItsBean

Could you please be more specific about the hypothesis and reference text that you are using? As for other tools you might want to use to compute the BLEU-4, which is the right one (9.3) as you correctly pointed out, I would advise you to use sacre-bleu, which seems to be the most reliable tool to compute such a score.

I'm posting the link here: https://github.com/mjpost/sacrebleu

ItsBean commented 1 year ago

Hello, I have checked the sacrebleu and the output is simmiliar to multi-bleu.perl.

In [1]: from sacrebleu.metrics import BLEU, CHRF, TER ...: ...: refs = [ # First set of references ...: ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'], ...: # Second set of references ...: ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'], ...: ] ...: sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

In [2]: bleu = BLEU()

In [3]: bleu.corpus_score(sys, refs) Out[3]: BLEU = 48.53 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)

And through this link [Question] How to interpret the evaluation metric for translation? · Issue #80 · mjpost/sacrebleu (github.com), i think the bleu-4 value here is 48.53, and the 37.5 is just a 4-gram precision score? In the paper "Using deep learning to generate complete log statement", do you report the value like 37.5 here? or the value like 48.53? There are different explanations about "bleu-4" on the web, and I really wonder which value is reported in the paper. Thank you very much!

antonio-mastropaolo commented 1 year ago

Hey @ItsBean In our paper, we report the BLEU-4. Thus, from the output you posted is going to be 37.5 The one you are referring to is called BLEU-A, which is the geometric mean of all the different scores.

ItsBean commented 1 year ago

Thank you!

zhipeng-cai commented 1 year ago

Hello, Dear authors:

I found the calculation of bleu-4 is descripted in your paper as follow:

The analysis of the “wrong” predictions is difficult to perform quantitatively for log messages as done for the level and the position. One possibility is to compute the BLEU (Bilingual Evaluation Understudy) score [37] between the generated and the reference messages. BLEU is used to assess the quality of an automatically generated text. Such a score ranges between 0.0 and 1.0, with 1.0 indicating that the generated and the reference message are identical. We adopt the BLEU-4 variant, which computes the overlap in terms of 4-grams between the generated and the reference messages.

Concerning the best-performing model (similar findings hold for the other models), we obtained an average BLEU-4 of 0.15.

I wonder if you only calculate the bleu-4 metric for those log message predictions which are "wrong", which means that during the calculation of bleu-4 metirc, you do not contains those perfectlly correct predictions?

I'm not sure if my understanding is correct?

antonio-mastropaolo commented 1 year ago

Hi zhipeng-cai,

we have used both (i.e., perfect and wrong).

Hope this helps.

antonio-mastropaolo / LANCE

question about bleu-4 #1