Closed Uotas closed 6 years ago
Hi @Uotas
If I understand correctly, the problem seems to be that there are no gold references within the keyphrase candidates. The issue may come from the candidate selection method that you use, you can try to extract n-grams instead. You should also check if the references actually occur in the document.
Florian
Hi @boudinfl
Yes, that's my problem. I try to extract n-grams but it doesn't work and I could know that about 15% gold keyphrases don't appear in the text in the paper(know from the paper 'SemEval-2010 Task 5: Automatic Keyphrase Extraction from scientific articles' ).The document readme.md shows that the f1-score of WINGNUS is about 0.2 so what the candidate selection method is used to achieve this result.
Yuxia
The candidate selection approaches in Kea and Wingnus are different. In Kea, all n-grams are selected as candidates while in Wingnus only the simplex noun phrases are selected. This is what is performed if you use the default candidate_selection()
method for both models, and this is what I used for benchmarking the two methods.
I will commit an example of training/testing procedure in the repo asap.
f.
Hi @boudinfl
I use the example code in issue 'How to set up a training set in a supervised method?' to test the given model kea-semeval2010.pickle and MINGNUS-semeval2010.pickle and I get the true result. Now I have a question whether the data format used must be 'corenlp' when training the model. Maybe that's the answer because I use the 'raw' data (.txt) to train the model.
yuxia
I've included an example of training and testing procedure in the repo (examples/training_and_testing_a_kea_model), please have a look at it.
Also, please update to the latest version of pke
by running:
pip install -U git+https://github.com/boudinfl/pke.git
Concerning the format of the document, you can use raw data but you have to change the format
and extension
paramters accordingly in train_supervised_model()
.
f.
Hi,
It is very kind of you that you could provide the examples of training and testing. However, I use the training example to train a mode with default code, it still has this problem. No gold references in the candidate keyphrases.
Here is my train procedure.
` import os import sys import codecs import logging import pke
logging.basicConfig(level=logging.INFO)
input_dir = 'D:\program\pke-master\examples\semeval2010\train'
reference_file = "D:\program\pke-master\examples\semeval2010\trainanswer\train.combined.stem.final"
df_file = "D:\program\pke-master\pke\models\train-lvl-1-df.tsv" logging.info('Loading df counts from {}'.format(df_file)) df_counts = pke.load_document_frequency_file(input_file=df_file, delimiter='\t')
output_mdl = "Kea_result.pickle"
pke.train_supervised_model(input_dir=input_dir, reference_file=reference_file, model_file=output_mdl, df=df_counts, format="raw", use_lemmas=False, stemmer="porter", model=pke.supervised.Kea(), language='english', extension="txt")`
After finishing training, there is a warning.
C:\Users\User0\Anaconda3\envs\python27\lib\site-packages\sklearn\naive_bayes.py:461: RuntimeWarning: divide by zero encountered in log self.class_logprior = (np.log(self.classcount) -
I don't know what is wrong.
Yuxia
The error comes from the scikit learn, can you send me an archive with your code/data so I can try to reproduce your issue?
Thanks
Hi,
I have sent to your email but I don't know whether it is received by your email server.
Yuxia
I try to use WINGNUS to train a model using semeval2010, however, none of the candidate selected from candidate_selection() is in the reference file. For example, the keyphrases in the C-41.txt are
C-41 : adapt resourc manag,distribut real-time embed system,end-to-end qualiti of servic+servic end-to-end qualiti,hybrid adapt resourcemanag middlewar,hybrid control techniqu,real-time video distribut system,real-time corba specif,video encod/decod,resourc reserv mechan,dynam environ,stream servic,distribut real-time emb system,hybrid system,qualiti of servic+servic qualiti
(reference keyphrases) but the candidate keyphrases selected arediffer execut, variou class of applic, qo of best-effort, rate with hyarm without, system architectur wireless, network bandwidth, endto-end real-tim qo, end-to-end qualiti of servic, dynam resourc, middlewar support in wide-area, system util, hyarm without, hyarm uav1 qo, receiv uav camera, system case, function of hyarm, natarajan, charter, system architectur, network with limit bandwidth, mpeg1 mpeg4 real, period of time...... and so on(selected candidates by candidate_selection())
As a result, when it labels whether the candidates are ture keyphrases,they all labeled by 0.for candidate in model.instances: if candidate in references[doc_id]: training_classes.append(1) else: training_classes.append(0) training_instances.append(model.instances[candidate])
I have not modified the code of candidate_selection() and the same problem appears when I use the KEA. Could you please tell me why? candidate_selection()def candidate_selection(self, NP='^((JJ|NN) ){,2}NN$', NP_IN_NP='^((JJ|NN) )?NN IN ((JJ|NN) )?NN$'): self.ngram_selection(n=4) self.candidate_filtering(stoplist=list(string.punctuation) + ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']) for k, v in self.candidates.items(): valid_surface_forms = [] for i in range(len(v.pos_patterns)): pattern = ' '.join([u[:2] for u in v.pos_patterns[i]]) if re.search(NP, pattern) or re.search(NP_IN_NP, pattern): valid_surface_forms.append(i) if not valid_surface_forms: del self.candidates[k] else: self.candidates[k].surface_forms = [v.surface_forms[i] for i in valid_surface_forms] self.candidates[k].offsets = [v.offsets[i] for i in valid_surface_forms] self.candidates[k].pos_patterns = [v.pos_patterns[i] for i in valid_surface_forms]
Thanks a lot!!