boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.56k stars 290 forks source link

Problem in KP-Miner candid weighting #117

Closed HamidHabibzadeh closed 4 years ago

HamidHabibzadeh commented 4 years ago

in KP-Miner method when compute candidate_weighting all the candidate are multiplied by a fixed number ( B boosting factor). Shouldn't this factor be calculated for unigram phrase only ? and df compute for all candidate ?

ygorg commented 4 years ago

Hi, are you saying this because of this sentence in the paper (2.2, p. 191) ?

So, a boosting factor is needed for compound terms in order to balance this bias towards single terms.

Are you saying that because the "boosting factor is needed for compound terms" it should only be applied to compound terms and not for single words ?

HamidHabibzadeh commented 4 years ago

Yes, exactly because in your kpminer implementation boosting factor (B) multiplied to all candidate phrase and not any meaningful affect in candidate weighting

  # compute the boosting factor
    B = min(N_d / (P_d * alpha), sigma)

    # loop throught the candidates
    for k, v in self.candidates.items():

        # get candidate document frequency
        candidate_df = 1

        # get the df for unigram only
        if len(v.lexical_form) == 1:
            candidate_df += df.get(k, 0)

        # compute the idf score
        idf = math.log(N / candidate_df, 2)

        self.weights[k] = len(v.surface_forms) * B * idf
ygorg commented 4 years ago

Yes I agree but in the article it is also stated that:

the following equation is used to calculate the weight of candidate keyphrases whether single or compound: wij = tfij idf Bi* Pf

Which contradicts the previous statement :

So, a boosting factor is needed for compound terms in order to balance this bias towards single terms.

I evaluated the actual implementation and a modified implementation (see below).

if len(v.lexical_form) == 1:
    self.weights[k] = len(v.surface_forms) * idf
else:
     self.weights[k] = len(v.surface_forms) * B * idf

The evaluation is performed on the SemEval-2010 test set (100 document) against the combined (reader + author) reference. Every keyphrase is stemmed for evaluation.

Method P@15 R@15 F@15
actual 21.1 22.0 21.4
modified 23.3 24.0 23.4

The modified implementation (not applying boosting factor to single word keyphrases) yields better results. I'll make a commit to change that. Thanks for your input.

ygorg commented 4 years ago

Fixed in #128