DerwenAI / pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
https://derwen.ai/docs/ptr/
MIT License
2.15k stars 333 forks source link

ZeroDivisionError: division by zero in _calc_discounted_normalised_rank #213

Open sumitkumarjethani opened 2 years ago

sumitkumarjethani commented 2 years ago

Hi,

I use this library together with spacy for the extraction of the most important words. However, when using the catalan model of spacy, the algorithm gives the following error:

`File "/code/app.py", line 20, in getNlpEntities

entities = runTextRankEntities(hl, contents['contents'], algorithm, num)

File "/code/nlp/textRankEntities.py", line 51, in runTextRankEntities

doc = nlp(joined_content)

File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1022, in call

error_handler(name, proc, [doc], e)

File "/usr/local/lib/python3.9/site-packages/spacy/util.py", line 1617, in raise_error

raise e

File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1017, in call

doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 253, in call

doc._.phrases = doc._.textrank.calc_textrank()

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 363, in calc_textrank

nc_phrases = self._collect_phrases(self.doc.noun_chunks, self.ranks)

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 548, in _collect_phrases

return {

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 549, in

span: self._calc_discounted_normalised_rank(span, sum_rank)

File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 592, in _calc_discounted_normalised_rank

phrase_rank = math.sqrt(sum_rank / (len(span) + non_lemma))

ZeroDivisionError: division by zero`

ceteri commented 2 years ago

Hi @sumitkumarjethani, thank you for this report. Let's get it fixed!

Could you please provide:

Many thanks! Paco

sumitkumarjethani commented 2 years ago

Yeah sure!

  1. Code used for execution: The original code has a quite modular structure, that's why I provide a quite similar version of the original to make it possible to run it locally (don't panic if it doesn't work as I wrote it on github itself).

""" Returns text rank entites """

def getTextRankEntities(doc):

entities = []

for phrase in doc._.phrases:
    phrase_dict = {}

    phrase_dict['entitie'] = phrase.text
    phrase_dict['score'] = phrase.rank
    phrase_dict['n_gram'] = len(phrase.text.split())
    phrase_dict['count'] = phrase.count

    entities.append(phrase_dict)
return entities

""" Main function to run text rank entites """

def runTextRankEntities(content):

entities = []

nlp = spacy.load("models/ca_core_news_lg-3.2.0/ca_core_news_lg/ca_core_news_lg-3.2.0") --> here you have to put the catalan pipeline name
nlp.add_pipe("textrank")

logger.info("Extracting entities with textrank algorithm")
doc = nlp(content)
entities = getTextRankEntities(doc)
logger.info("Entities extracted")
return entities
  1. With regard to the example data where the exception occurs, I am afraid I cannot provide it. However, you can create a string with text in catalan and pass it to the function runTextRankEntities(content).
  2. For the installation of spacy, the following command was executed: pip install spacy
  3. For the installation of spacy catalan model I use the wget command from the repo: https://github.com/explosion/spacy-models/releases/download/ca_core_news_lg-3.2.0/ca_core_news_lg-3.2.0.tar.gz
  4. Spacy version: 3.2.3 | Spacy catalan language model version: 3.2.0
  5. OS: Windows 10 Home

Any other requirements please let me know and I will try to respond as soon as possible.

Thank you very much