SkBlaz / rakun

Rank-based Unsupervised Keyword Extraction via Metavertex Aggregation
GNU General Public License v3.0
99 stars 22 forks source link

"for loop" in "missing connective" part is a suspected case of bug #5

Closed apollllon closed 4 years ago

apollllon commented 4 years ago

Hello, I majored in computer science at Okayama University in Japan.

When I use this algorithm, RaKUn for my NLP task by using Japanese, I found for loop in "missing connectives" part is a suspected case of bug.

it is following part, file: __init__.py" line: 449 ...

for ind in i1_indexes:
                            if ind + 2 in i2_indexes_map:
                                joint_kw = " ".join([
                                    p1, self.raw_text[ind + 1],
                                    self.raw_text[ind + 2]
                                ])
                                final_keywords.append((joint_kw, kw[1]))
                                joint = True

I think if you don't exit this for loop with break, there is a possibility that multiple similar keywords will be added. Is this normal? Could you please confirm this?

SkBlaz commented 4 years ago

Hey, @apollllon , thanks for the issue. This connectives feature is in beta mode, and could indeed misbehave. However, the case you raised, should it have a break in there, prohibits addition of multiple similar phrases, such as for example: "castle for knights" and "castle of knights", which might hold valuable semantic meaning. I think the solution to this is that all such keywords get included, however, only unique are maintained. This would solve both problems. You can implement it if you want and open a pull request!

Thanks

apollllon commented 4 years ago

@SkBlaz, I'm glad you answered kindly and quickly!

I understand your idea. Thank you.