JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

ROUGE: fixed sentences calculation and some minor refactoring #272

Closed rssdev10 closed 8 months ago

rssdev10 commented 9 months ago

Hi, I started looking at the ROUGE-N/L evaluation components and found that the code was sometimes inefficient and sometimes looked like a bug. This is a first commit with some improvements and fixes.

First of all, I opened the original article https://aclanthology.org/W04-1013.pdf and found an incorrect mention of the "jackknifing procedure". In the code https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/utils.jl#L7 there is some implementation with a name jackknife_avg, but this is not the average. The implemented version is looks like argmax as literally mentioned in the paper but has a wrong description. See https://en.wikipedia.org/wiki/Jackknife_resampling - the Jackknife average is the same as average and never used in this role.

Moreover, I found the original implementation ROUGE-1.5.5.pl. And, in the changelog, I see:

#              1.4
#              (1) Remove internal Jackknifing procedure:
#                  Now the ROUGE script will use all the references listed in the
#                  <MODEL></MODEL> section in each <EVAL></EVAL> section and no
#                  automatic Jackknifing is performed. Please see RELEASE-NOTE.txt
#                  for more details.

In the Perl implementation I see a very simple solution:

        if ( $scoreMode eq "A" ) {
            $totalLCSHit    += $lcsHit;
            $totalLCSCount  += $lcsBase;
            $totalLCSCountP += $peer_1grams{"_cn_"};
        }
        elsif ( $scoreMode eq "B" ) {
            if ( $lcsScore > $lcsScoreBest ) {

                # only take a better score (i.e. better match)
                $lcsScoreBest   = $lcsScore;
                $totalLCSHit    = $lcsHit;
                $totalLCSCount  = $lcsBase;
                $totalLCSCountP = $peer_1grams{"_cn_"};
            }
        }
        else {
            # use average mode
            $totalLCSHit    += $lcsHit;
            $totalLCSCount  += $lcsBase;
            $totalLCSCountP += $peer_1grams{"_cn_"};
        }

If there are no objections, my offer is to remove the function with the name jackknife_avg and implement it in a similar way.

I found a very suspicious place in the method rouge_l_summary. There is a loop for ref_sent in ref_sent_list by a sentence in the reference text. But ref_sent is not used in the inner loop. OK, I fixed it, but unit tests fail because the final score is lower than the previous one.

@djokester and @Ayushk4 how did you calculate the values for the unit test?

Next, I found an inefficient implementation of the method weighted_lcs with creating an array of arrays of Float64 instead of a single matrix of Int32. And, I separated this method into two different ones to provide type stability, the initial method weighted_lcs returns either Int64 or String. This is definitely another performance issue.

At the same time, all the methods rouge_n, rouge_l_sentence, rouge_l_summary are type unstable and return a single score or an array of scores depending on arguments. We can specify a type Union{Float64, Vector{Float64}} but do we really need it? For now this is the only non API breaking way to fix it.

An alternative way is to return an array and implement an external average/argmax function. Or return a named tuple with both the single score, the array, and, maybe, a separate recall, precision, F.

Any thoughts on this?

aviks commented 9 months ago

If the best way to fix the rouge_n methods is to break api compatibility, I think that should be fine. We can tag a new minor release with that. Or even a 1.0

rssdev10 commented 8 months ago

Added initial changes:

sub wlcsWeightInverse { my $r = shift; my $power = shift;

return $r**( 1 / $power );

}



But Google's implementation doesn't use weights - https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py#L210

The method `rouge_l_sentence(reference_summaries, candidate_summary, 1)` still gives different result with Google implementation. But there is a difference in the tokenizer (dot symbol is a token in our case).

One failed test I see with Julia 1.3 because I used `@kwdef` for decoration purposes. Do we really need Julia 1.3?...

Any feedback?
aviks commented 8 months ago

We don't need 1.3. Can do 1.6 minimum.

rssdev10 commented 8 months ago

rouge_l_sentence now gives similar result as Google's implementation. But this is due to the previous weighting being turned off.

And, the last open question - do we need to have sqrt as a function for weighting? Or, should we have the same implementation as ROUGE-1.5.5.pl with the function power and argument 0.5 for this case? ...

rssdev10 commented 8 months ago

Added a minor change - the weighting function can now be changed as an argument if necessary. At least both cases - Google's without weighting and previous implementation with fixed sqrt may be used. If there are no other ideas, I'm willing to merge it.

aviks commented 8 months ago

If there are no other ideas, I'm willing to merge it.

I'm OK with that plan.