Pass compute_document_frequency the spacy modell

boudinfl / pke

Python Keyphrase Extraction module

GNU General Public License v3.0

1.56k stars 290 forks source link

Pass compute_document_frequency the spacy modell #159

Closed 95Marmite closed 3 years ago

95Marmite commented 3 years ago

I wanted to calculate the TF from a data set with 12000 documents. Would have taken 12 hours. The weak point is the multiple loading of the Spacy model. Here a parameter in the compute_document_frequency function can significantly reduce the runtime.

ygorg commented 3 years ago

Hi, yes you are right ! #105 introduced the spacy_model parameter for load_document, but it is unusable in compute_document_frequency. If you can, hack the function by adding a spacy_model argument at utils.py#72 and utils.py#113. (you can find the path of pke by running import pke; print(pke.__file__)). Else you can preprocess your file using Stanford CoreNLP. In last resort tell me and I'll add the parameter for good.

95Marmite commented 3 years ago

Hey ygorg, thank you for your prompt reply. I already had a clone and was able to implement that.

I have already tried to preprocess my text data with the wrapper stanza for Stanford CoreNLP. However, I have not found an export to the xml format. Do you have a tip?

PS: I am currently working on a master's thesis for KE for the German language. Your developed package is a great help to me, thank you very much!

ygorg commented 3 years ago

Great !

The tool we used in ake-dataset was the java version of coreNLP. Which output the XML. I should look into Stanza, to see whether the models are the same to the java version. But generally the way we input documents into pke should be revised.

So nice to hear people working on KE for non-english languages ! Good luck with your master's !!