boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.56k stars 290 forks source link

issue about WINGNUS and Kea #38

Closed Uotas closed 6 years ago

Uotas commented 6 years ago

I try to use WINGNUS to train a model using semeval2010, however, none of the candidate selected from candidate_selection() is in the reference file. For example, the keyphrases in the C-41.txt are C-41 : adapt resourc manag,distribut real-time embed system,end-to-end qualiti of servic+servic end-to-end qualiti,hybrid adapt resourcemanag middlewar,hybrid control techniqu,real-time video distribut system,real-time corba specif,video encod/decod,resourc reserv mechan,dynam environ,stream servic,distribut real-time emb system,hybrid system,qualiti of servic+servic qualiti(reference keyphrases) but the candidate keyphrases selected are differ execut, variou class of applic, qo of best-effort, rate with hyarm without, system architectur wireless, network bandwidth, endto-end real-tim qo, end-to-end qualiti of servic, dynam resourc, middlewar support in wide-area, system util, hyarm without, hyarm uav1 qo, receiv uav camera, system case, function of hyarm, natarajan, charter, system architectur, network with limit bandwidth, mpeg1 mpeg4 real, period of time...... and so on(selected candidates by candidate_selection()) As a result, when it labels whether the candidates are ture keyphrases,they all labeled by 0. for candidate in model.instances: if candidate in references[doc_id]: training_classes.append(1) else: training_classes.append(0) training_instances.append(model.instances[candidate]) I have not modified the code of candidate_selection() and the same problem appears when I use the KEA. Could you please tell me why? candidate_selection() def candidate_selection(self, NP='^((JJ|NN) ){,2}NN$', NP_IN_NP='^((JJ|NN) )?NN IN ((JJ|NN) )?NN$'): self.ngram_selection(n=4) self.candidate_filtering(stoplist=list(string.punctuation) + ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']) for k, v in self.candidates.items(): valid_surface_forms = [] for i in range(len(v.pos_patterns)): pattern = ' '.join([u[:2] for u in v.pos_patterns[i]]) if re.search(NP, pattern) or re.search(NP_IN_NP, pattern): valid_surface_forms.append(i) if not valid_surface_forms: del self.candidates[k] else: self.candidates[k].surface_forms = [v.surface_forms[i] for i in valid_surface_forms] self.candidates[k].offsets = [v.offsets[i] for i in valid_surface_forms] self.candidates[k].pos_patterns = [v.pos_patterns[i] for i in valid_surface_forms] Thanks a lot!!

boudinfl commented 6 years ago

Hi @Uotas

If I understand correctly, the problem seems to be that there are no gold references within the keyphrase candidates. The issue may come from the candidate selection method that you use, you can try to extract n-grams instead. You should also check if the references actually occur in the document.

Florian

Uotas commented 6 years ago

Hi @boudinfl

Yes, that's my problem. I try to extract n-grams but it doesn't work and I could know that about 15% gold keyphrases don't appear in the text in the paper(know from the paper 'SemEval-2010 Task 5: Automatic Keyphrase Extraction from scientific articles' ).The document readme.md shows that the f1-score of WINGNUS is about 0.2 so what the candidate selection method is used to achieve this result.

Yuxia

boudinfl commented 6 years ago

The candidate selection approaches in Kea and Wingnus are different. In Kea, all n-grams are selected as candidates while in Wingnus only the simplex noun phrases are selected. This is what is performed if you use the default candidate_selection() method for both models, and this is what I used for benchmarking the two methods.

I will commit an example of training/testing procedure in the repo asap.

f.

Uotas commented 6 years ago

Hi @boudinfl

I use the example code in issue 'How to set up a training set in a supervised method?' to test the given model kea-semeval2010.pickle and MINGNUS-semeval2010.pickle and I get the true result. Now I have a question whether the data format used must be 'corenlp' when training the model. Maybe that's the answer because I use the 'raw' data (.txt) to train the model.

yuxia

boudinfl commented 6 years ago

I've included an example of training and testing procedure in the repo (examples/training_and_testing_a_kea_model), please have a look at it.

Also, please update to the latest version of pkeby running: pip install -U git+https://github.com/boudinfl/pke.git

Concerning the format of the document, you can use raw data but you have to change the format and extension paramters accordingly in train_supervised_model().

f.

Uotas commented 6 years ago

Hi,

It is very kind of you that you could provide the examples of training and testing. However, I use the training example to train a mode with default code, it still has this problem. No gold references in the candidate keyphrases.

Here is my train procedure.

` import os import sys import codecs import logging import pke

logging.basicConfig(level=logging.INFO)

input_dir = 'D:\program\pke-master\examples\semeval2010\train'

reference_file = "D:\program\pke-master\examples\semeval2010\trainanswer\train.combined.stem.final"

df_file = "D:\program\pke-master\pke\models\train-lvl-1-df.tsv" logging.info('Loading df counts from {}'.format(df_file)) df_counts = pke.load_document_frequency_file(input_file=df_file, delimiter='\t')

output_mdl = "Kea_result.pickle"

pke.train_supervised_model(input_dir=input_dir, reference_file=reference_file, model_file=output_mdl, df=df_counts, format="raw", use_lemmas=False, stemmer="porter", model=pke.supervised.Kea(), language='english', extension="txt")`

After finishing training, there is a warning.

C:\Users\User0\Anaconda3\envs\python27\lib\site-packages\sklearn\naive_bayes.py:461: RuntimeWarning: divide by zero encountered in log self.class_logprior = (np.log(self.classcount) -

I don't know what is wrong.

Yuxia

boudinfl commented 6 years ago

The error comes from the scikit learn, can you send me an archive with your code/data so I can try to reproduce your issue?

Thanks

Uotas commented 6 years ago

Hi,

I have sent to your email but I don't know whether it is received by your email server.

Yuxia