Not able to load custom language

seansaito commented 2 years ago

Hi, we're using pke for Japanese keyword extraction with a custom library (Ginza) https://megagonlabs.github.io/ginza/

Until version 1.8.1, pke worked fine. However, with the recent major release (literally hours ago), we're unable to load and we're unable to extract keywords:

[2022-03-08 04:02:13,409] {readers.py:65} ERROR - No spacy model for 'ja_ginza' language.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/pke/base.py", line 117, in load_document
    for i, sentence in enumerate(self.sentences):
TypeError: 'NoneType' object is not iterable
"""[2022-03-08 04:02:13,408] {readers.py:65} ERROR - No spacy model for 'ja_ginza' language.
[2022-03-08 04:02:13,408] {readers.py:65} ERROR - No spacy model for 'ja_ginza' language.

Is it possible for you to provide a link to the pke 1.8.1 release? Seems like you have deleted it from this repo. Thanks!

seansaito commented 2 years ago

Also, even if I choose the default Japanese spacy model it fails to load:

[2022-03-08 04:24:44,928] {readers.py:65} ERROR - No spacy model for 'ja_core_news_sm' language.
[2022-03-08 04:24:44,928] {readers.py:66} ERROR - A list of available spacy models is available at https://spacy.io/models.

boudinfl commented 2 years ago

Sorry I broke many things as I did a lot of refactoring to simplify further development and ease maintaining.

I think the issue comes from the fact that japanese is missing from lang.py, I'll do some tests and get back to you.

boudinfl commented 2 years ago

So it seems that the issue was simply the japanese langcode missing from lang.py. It is now fix in fede063cf89829108bdc4dd51dcffe5317151baf

To test, I installed the japanese spacy model using:

python -m spacy download ja_core_news_sm

and then run the following python code with success:

import pke

sample = """富士山（、英語: Mount Fuji）は、山梨県（富士吉田市、南都留郡鳴沢村）と、
静岡県（富士宮市、富士市、裾野市、御殿場市、駿東郡小山町）に跨る活火山である[注釈 3]。
標高3776.12 m、日本最高峰（剣ヶ峰）[注釈 4]の独立峰で、
その優美な風貌は日本国外でも日本の象徴として広く知られている。"""

extractor = pke.unsupervised.FirstPhrases()
extractor.load_document(input=sample, language='ja')
extractor.candidate_selection()
extractor.candidate_weighting()
print(extractor.get_n_best(n=10))

which produces

[('富士 山 （', 0), ('英語', -4), ('mount fuji ）', -6), ('山梨 県 （ 富士吉田 市', -11), ('南都留 郡 鳴沢村 ）', -17), ('静岡 県 （ 富士宮 市', -24), ('富士 市', -30), ('裾野 市', -33), ('御殿場 市', -36), ('駿東 郡 小山 町 ）', -39)]

You should also be able to use a custom spacy model using the spacy_model parameter as:

import pke
import spacy

nlp = spacy.load("your model")
extractor = pke.unsupervised.FirstPhrases()
extractor.load_document(input="some japanese text", language='ja', spacy_model=nlp)

Please let me know if this feature works (AFAIK it is untested).

Best,

f.

seansaito commented 2 years ago

Thanks!

Unfortunately, custom spacy models can fail when you try to add the "sentencizer" to the pipeline for a model which already has one (which is the case for our custom japanese model):

"""
Traceback (most recent call last):
  File "/home/devuser/src/ml/keywords/keyword_extractor.py", line 94, in do_yake
    extractor.load_document(input=text, language="ja_ginza", normalization=None, spacy_model=nlp)
  File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/pke/base.py", line 93, in load_document
    sents = parser.read(text=input, spacy_model=spacy_model)
  File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/pke/readers.py", line 70, in read
    nlp.add_pipe('sentencizer')
  File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/spacy/language.py", line 771, in add_pipe
    raise ValueError(Errors.E007.format(name=name, opts=self.component_names))
ValueError: [E007] 'sentencizer' already exists in pipeline. Existing names: ['tok2vec', 'parser', 'attribute_ruler', 'ner', 'morphologizer', 'compound_splitter', 'bunsetu_recognizer', 'sentencizer']
"""

boudinfl commented 2 years ago

Hum, I just removed the sentencizer for custom models in 3cfe17b4bfb27cd5d74393dce5e0a53583b85f42

seansaito commented 2 years ago

@boudinfl Got it, thanks!

By the way, could you let me know which commit points to version 1.8.1 exactly? Want to keep this commit for the sake of backwards compatibility. Thanks a lot for looking into this!

boudinfl commented 2 years ago

pke 1.8.1 would be f651015f9c931cf245a753f4457bb49f0befa5fd

boudinfl / pke

Not able to load custom language #184