Closed edoost closed 6 years ago
As far as I know, the format of the input file is XML(Stanford CoreNLP)(You can also use text). Because,table of precision (P), recall (R) and f-measure (F) at the top 10 keyphrases is the average of the output of the 4 XML file in "lvl-1","lvl-2","lvl-3" and "lvl-4" in test data(Preprocessed SemEval-2010 Benchmark dataset)(Actually I get the similar output).
Link: https://github.com/boudinfl/semeval-2010-pre
Gold standard keyphrases are the keyphrase that is given by the author and reader(file name: test.combined.stem.final)[1]. Here I add gold standard keyphrases file(with author and reader also). The supervised model is already trained by the training set of the SemEval-2010 benchmark dataset. I found it in "Already trained supervised models" in the Readme section.
Cordially, Gollam Rabby Masters Student, FSKKP, UMP, Gambang, Kuantan, Malaysia
Thank you very much, but I'm actually trying to train the model on Persian.
Hi @edoost
I've included an example of training and testing procedure in the repo (examples/training_and_testing_a_kea_model), please have a look at it.
Concerning the format of the document, you can use raw data but you have to change the format
and extension
parameters accordingly in train_supervised_model()
.
f.
@boudinfl Another question. How should I pass my stoplist to kea?
Hi @edoost,
The stoplist is used to filter out spurrious keyphrase candidates (i.e. n-grams beginning/ending with a stopword), you can pass your stoplist to the candidate_selection(stoplist=['a', 'the'])
method.
Please have a look at the documentation for more details : https://boudinfl.github.io/pke/build/html/supervised.html#kea
f.
I wonder what should the format of the input file for the supervised algorithms be. I mean I don't know where to save the gold keyphrases and how to feed it to the model.