DerwenAI / pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
https://derwen.ai/docs/ptr/
MIT License
2.15k stars 333 forks source link

ValueError: Cannot get dimension 'nO' for model 'sparse_linear': value unset #165

Closed kaiyungtan closed 1 year ago

kaiyungtan commented 3 years ago

Hi,I am trying out pytextrank for extractive summarization. I used the example code provided. but it didn't work.

the error come from this code:

# add PyTextRank to the spaCy pipeline
nlp.add_pipe('textrank', last=True)

ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, textcat_multilabel, en.lemmatizer

So, I edited the code as follow:

# add PyTextRank to the spaCy pipeline
nlp.add_pipe('textcat','textrank', last=True)

ValueError: Cannot get dimension 'nO' for model 'sparse_linear': value unset

I checked nlp.pipe_names:

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'textrank']

and my spacy version and details:

spaCy version 3.0.0rc5
Platform Linux-4.14.225-121.362.amzn1.x86_64-x86_64-with-glibc2.9 Python version 3.6.13
Pipelines en_core_web_sm (3.0.0)

Do you know how could I solve this issue?

Thanks

louisguitton commented 3 years ago

Hi @kaiyungtan , thanks for checking out pytextrank.

Given the error message you gave us, I can tell that textrank is missing from the Available factories: ... list. This means that you're probably missing an import. In particular, don't forget this line

import pytextrank

In the docs https://derwen.ai/docs/ptr/explain_summ/ you can find the reproducible snippet of code, which I will add here for reference:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

import pytextrank

nlp.add_pipe("textrank", last=True)
doc = nlp(text)

# I add this line to show how to get the summary
summary = list(doc._.textrank.summary(limit_phrases=3, limit_sentences=4, preserve_order=False))

This should solve your issue.

Regarding your second try with nlp.add_pipe('textcat','textrank', last=True), yes it's working but it's not doing what you're expecting I guess, because in this situation you're expecting the wrong thing. According to the spacy docs, Language.add_pipe takes two positional arguments: factory_name and name. You are doing nlp.add_pipe(factory_name='textcat', name='textrank', last=True), which is adding a Text Classification pipeline component (with a factory named "textcat" see docs), and then you're renaming it with a name "textrank", but that component is not the component from the pytextrank library that does the extractive summarisation you're after.

kaiyungtan commented 3 years ago

Hi @louisguitton , thanks for the quick response and explanation for the 'textcat'.

I actually did import pytextrank. As you can see from the screenshot below:

Screenshot 2021-05-17 at 12 11 48

I tried it on google colab and on Amazon SageMaker instance - jupyternotebook. It still the same error I am getting.

louisguitton commented 3 years ago

ah I see @kaiyungtan . From your issue description, I see

spaCy version 3.0.0rc5
Platform Linux-4.14.225-121.362.amzn1.x86_64-x86_64-with-glibc2.9
Python version 3.6.13
Pipelines en_core_web_sm (3.0.0)

Can you check your pytextrank version on that SageMaker instance like so?

In [1]: import pytextrank

In [2]: pytextrank.__version__
Out[2]: '3.1.2'

It can be that you're using without knowing an older pytextrank version (because of what pip dependencies are cached on the SageMaker environment you're using). v3 and later versions of pytextrank introduce breaking changes due to spacy v3 compatibility. See https://github.com/DerwenAI/pytextrank/releases image

So if you run the above check and see a 2.x.x version, please run in a cell:

!pip install -U pytextrank