Closed 95Marmite closed 3 years ago
Hi, yes you are right ! #105 introduced the spacy_model
parameter for load_document
, but it is unusable in compute_document_frequency
.
If you can, hack the function by adding a spacy_model
argument at utils.py#72
and utils.py#113
. (you can find the path of pke by running import pke; print(pke.__file__)
).
Else you can preprocess your file using Stanford CoreNLP.
In last resort tell me and I'll add the parameter for good.
Hey ygorg, thank you for your prompt reply. I already had a clone and was able to implement that.
I have already tried to preprocess my text data with the wrapper stanza for Stanford CoreNLP. However, I have not found an export to the xml format. Do you have a tip?
PS: I am currently working on a master's thesis for KE for the German language. Your developed package is a great help to me, thank you very much!
Great !
The tool we used in ake-dataset was the java version of coreNLP. Which output the XML. I should look into Stanza, to see whether the models are the same to the java version. But generally the way we input documents into pke should be revised.
So nice to hear people working on KE for non-english languages ! Good luck with your master's !!
I wanted to calculate the TF from a data set with 12000 documents. Would have taken 12 hours. The weak point is the multiple loading of the Spacy model. Here a parameter in the compute_document_frequency function can significantly reduce the runtime.