Closed seansaito closed 2 years ago
Also, even if I choose the default Japanese spacy model it fails to load:
[2022-03-08 04:24:44,928] {readers.py:65} ERROR - No spacy model for 'ja_core_news_sm' language.
[2022-03-08 04:24:44,928] {readers.py:66} ERROR - A list of available spacy models is available at https://spacy.io/models.
Sorry I broke many things as I did a lot of refactoring to simplify further development and ease maintaining.
I think the issue comes from the fact that japanese
is missing from lang.py, I'll do some tests and get back to you.
So it seems that the issue was simply the japanese
langcode missing from lang.py
. It is now fix in fede063cf89829108bdc4dd51dcffe5317151baf
To test, I installed the japanese spacy model using:
python -m spacy download ja_core_news_sm
and then run the following python code with success:
import pke
sample = """富士山(、英語: Mount Fuji)は、山梨県(富士吉田市、南都留郡鳴沢村)と、
静岡県(富士宮市、富士市、裾野市、御殿場市、駿東郡小山町)に跨る活火山である[注釈 3]。
標高3776.12 m、日本最高峰(剣ヶ峰)[注釈 4]の独立峰で、
その優美な風貌は日本国外でも日本の象徴として広く知られている。"""
extractor = pke.unsupervised.FirstPhrases()
extractor.load_document(input=sample, language='ja')
extractor.candidate_selection()
extractor.candidate_weighting()
print(extractor.get_n_best(n=10))
which produces
[('富士 山 (', 0), ('英語', -4), ('mount fuji )', -6), ('山梨 県 ( 富士吉田 市', -11), ('南都留 郡 鳴沢村 )', -17), ('静岡 県 ( 富士宮 市', -24), ('富士 市', -30), ('裾野 市', -33), ('御殿場 市', -36), ('駿東 郡 小山 町 )', -39)]
You should also be able to use a custom spacy model using the spacy_model
parameter as:
import pke
import spacy
nlp = spacy.load("your model")
extractor = pke.unsupervised.FirstPhrases()
extractor.load_document(input="some japanese text", language='ja', spacy_model=nlp)
Please let me know if this feature works (AFAIK it is untested).
Best,
f.
Thanks!
Unfortunately, custom spacy models can fail when you try to add the "sentencizer" to the pipeline for a model which already has one (which is the case for our custom japanese model):
"""
Traceback (most recent call last):
File "/home/devuser/src/ml/keywords/keyword_extractor.py", line 94, in do_yake
extractor.load_document(input=text, language="ja_ginza", normalization=None, spacy_model=nlp)
File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/pke/base.py", line 93, in load_document
sents = parser.read(text=input, spacy_model=spacy_model)
File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/pke/readers.py", line 70, in read
nlp.add_pipe('sentencizer')
File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/spacy/language.py", line 771, in add_pipe
raise ValueError(Errors.E007.format(name=name, opts=self.component_names))
ValueError: [E007] 'sentencizer' already exists in pipeline. Existing names: ['tok2vec', 'parser', 'attribute_ruler', 'ner', 'morphologizer', 'compound_splitter', 'bunsetu_recognizer', 'sentencizer']
"""
Hum, I just removed the sentencizer
for custom models in 3cfe17b4bfb27cd5d74393dce5e0a53583b85f42
@boudinfl Got it, thanks!
By the way, could you let me know which commit points to version 1.8.1 exactly? Want to keep this commit for the sake of backwards compatibility. Thanks a lot for looking into this!
pke 1.8.1 would be f651015f9c931cf245a753f4457bb49f0befa5fd
Hi, we're using pke for Japanese keyword extraction with a custom library (Ginza) https://megagonlabs.github.io/ginza/
Until version 1.8.1, pke worked fine. However, with the recent major release (literally hours ago), we're unable to load and we're unable to extract keywords:
Is it possible for you to provide a link to the pke 1.8.1 release? Seems like you have deleted it from this repo. Thanks!