boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.56k stars 290 forks source link

The format of the input file for the supervised algorithms #45

Closed edoost closed 6 years ago

edoost commented 6 years ago

I wonder what should the format of the input file for the supervised algorithms be. I mean I don't know where to save the gold keyphrases and how to feed it to the model.

corei5 commented 6 years ago

As far as I know, the format of the input file is XML(Stanford CoreNLP)(You can also use text). Because,table of precision (P), recall (R) and f-measure (F) at the top 10 keyphrases is the average of the output of the 4 XML file in "lvl-1","lvl-2","lvl-3" and "lvl-4" in test data(Preprocessed SemEval-2010 Benchmark dataset)(Actually I get the similar output).

Link: https://github.com/boudinfl/semeval-2010-pre

Gold standard keyphrases are the keyphrase that is given by the author and reader(file name: test.combined.stem.final)[1]. Here I add gold standard keyphrases file(with author and reader also). The supervised model is already trained by the training set of the SemEval-2010 benchmark dataset. I found it in "Already trained supervised models" in the Readme section.

  1. https://pdfs.semanticscholar.org/c26d/b4b36419cb0b2cbe42949bc669848a181326.pdf

Cordially, Gollam Rabby Masters Student, FSKKP, UMP, Gambang, Kuantan, Malaysia

edoost commented 6 years ago

Thank you very much, but I'm actually trying to train the model on Persian.

boudinfl commented 6 years ago

Hi @edoost

I've included an example of training and testing procedure in the repo (examples/training_and_testing_a_kea_model), please have a look at it.

Concerning the format of the document, you can use raw data but you have to change the format and extension parameters accordingly in train_supervised_model().

f.

edoost commented 6 years ago

@boudinfl Another question. How should I pass my stoplist to kea?

boudinfl commented 6 years ago

Hi @edoost,

The stoplist is used to filter out spurrious keyphrase candidates (i.e. n-grams beginning/ending with a stopword), you can pass your stoplist to the candidate_selection(stoplist=['a', 'the']) method.

Please have a look at the documentation for more details : https://boudinfl.github.io/pke/build/html/supervised.html#kea

f.