Closed rssdev10 closed 8 months ago
If the best way to fix the rouge_n
methods is to break api compatibility, I think that should be fine. We can tag a new minor release with that. Or even a 1.0
Added initial changes:
found non described in the initial code "novelty" https://github.com/JuliaText/TextAnalysis.jl/blob/eb6b9aa8d02823f98a045b873b54a3fed1356efb/src/utils.jl#L51 with calculation of weights with sqrt
function. Initial Perl-based implementation ROUGE-1.5.5.pl contains power
function:
sub wlcsWeight {
my $r = shift;
my $power = shift;
return $r**$power;
}
sub wlcsWeightInverse { my $r = shift; my $power = shift;
return $r**( 1 / $power );
}
But Google's implementation doesn't use weights - https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py#L210
The method `rouge_l_sentence(reference_summaries, candidate_summary, 1)` still gives different result with Google implementation. But there is a difference in the tokenizer (dot symbol is a token in our case).
One failed test I see with Julia 1.3 because I used `@kwdef` for decoration purposes. Do we really need Julia 1.3?...
Any feedback?
We don't need 1.3. Can do 1.6 minimum.
rouge_l_sentence
now gives similar result as Google's implementation. But this is due to the previous weighting being turned off.
And, the last open question - do we need to have sqrt
as a function for weighting? Or, should we have the same implementation as ROUGE-1.5.5.pl with the function power
and argument 0.5
for this case? ...
Added a minor change - the weighting function can now be changed as an argument if necessary. At least both cases - Google's without weighting and previous implementation with fixed sqrt
may be used. If there are no other ideas, I'm willing to merge it.
If there are no other ideas, I'm willing to merge it.
I'm OK with that plan.
Hi, I started looking at the ROUGE-N/L evaluation components and found that the code was sometimes inefficient and sometimes looked like a bug. This is a first commit with some improvements and fixes.
First of all, I opened the original article https://aclanthology.org/W04-1013.pdf and found an incorrect mention of the "jackknifing procedure". In the code https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/utils.jl#L7 there is some implementation with a name
jackknife_avg
, but this is not theaverage
. The implemented version is looks likeargmax
as literally mentioned in the paper but has a wrong description. See https://en.wikipedia.org/wiki/Jackknife_resampling - the Jackknife average is the same as average and never used in this role.Moreover, I found the original implementation ROUGE-1.5.5.pl. And, in the changelog, I see:
In the Perl implementation I see a very simple solution:
If there are no objections, my offer is to remove the function with the name
jackknife_avg
and implement it in a similar way.I found a very suspicious place in the method
rouge_l_summary
. There is a loopfor ref_sent in ref_sent_list
by a sentence in the reference text. Butref_sent
is not used in the inner loop. OK, I fixed it, but unit tests fail because the final score is lower than the previous one.@djokester and @Ayushk4 how did you calculate the values for the unit test?
Next, I found an inefficient implementation of the method
weighted_lcs
with creating an array of arrays of Float64 instead of a single matrix of Int32. And, I separated this method into two different ones to provide type stability, the initial methodweighted_lcs
returns either Int64 or String. This is definitely another performance issue.At the same time, all the methods
rouge_n
,rouge_l_sentence
,rouge_l_summary
are type unstable and return a single score or an array of scores depending on arguments. We can specify a typeUnion{Float64, Vector{Float64}}
but do we really need it? For now this is the only non API breaking way to fix it.An alternative way is to return an array and implement an external average/argmax function. Or return a named tuple with both the single score, the array, and, maybe, a separate recall, precision, F.
Any thoughts on this?