DerwenAI / pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
https://derwen.ai/docs/ptr/
MIT License
2.15k stars 333 forks source link

using "noun_chunks" from custom extension #54

Closed fukidzon closed 1 year ago

fukidzon commented 4 years ago

I wanted to use pytextrank together with spacy_udpipe to get keywords from texts in other languages (see https://stackoverflow.com/questions/59824405/spacy-udpipe-with-pytextrank-to-extract-keywords-from-non-english-text) but I realized, that udpipe-spacy somehow "overrides" the original spacy's pipeline so the noun_chunks are not generated (btw: the noun_chunks are created in lang/en/syntax_iterators.py but it doesn't exist for all languages so even if it is called, it doesn't work e.g. for Slovak language)

Pytextrank keywords are taken from the spacy doc.noun_chunks, but if the noun_chunks are not generated, pytextrank doesn't work.

Sample code:

import spacy_udpipe, spacy, pytextrank
spacy_udpipe.download("en") # download English model
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# using spacy_udpipe
nlp_udpipe = spacy_udpipe.load("en")
tr = pytextrank.TextRank(logger=None)
nlp_udpipe.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc_udpipe = nlp_udpipe(text)

print("keywords from udpipe processing:")
for phrase in doc_udpipe._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

# loading original spacy model
nlp_spacy = spacy.load("en_core_web_sm")
tr2 = pytextrank.TextRank(logger=None)
nlp_spacy.add_pipe(tr2.PipelineComponent, name="textrank", last=True)
doc_spacy = nlp_spacy(text)

print("keywords from spacy processing:")
for phrase in doc_spacy._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

Would it be possible that pytextrank processes the "nounchunks" (candidates for keywords) from a custom extension (function which uses a Matcher and the result is available e.g. as a doc..custom_noun_chunks - see https://github.com/explosion/spaCy/issues/3856 )?

ceteri commented 4 years ago

Thank you @fukidzon

I've reworked the code that you provided as a Colab notebook: https://gist.github.com/ceteri/f3bfac641cffb61e10af5aae7eefc9dd so people can view and interact with the problem.

The root issue appears to be that the noun chunks produced by the UDpipe extension are much less rich than what spaCy produces? Or is that information available through some other field?

To address your main question:

If you have an example of an extension that has implemented "3856" then yes we could add support for that in the next release.

ceteri commented 4 years ago

Also, there's another implied question:

While that's possible, and somewhat closer to the original algorithm description, it would a larger job to refactor the code.

I'll take a look, and try to scope it. We may be able to add a use_chunks flag that's True by default.

Back to your original question on StackOverflow could you provide a brief example text in sk along with the code you're using to run that pipeline? If you could provide what phrases would be expected too, that would help lots!

ceteri commented 4 years ago

Another issue that was mentioned:

I get tokens with POS and DEP tags, but there is nothing in doc._.phrases (doc.noun_chunks is also empty) and in nlp.pipe_names is just ['textrank']

See the gist it appears that spacy_udpipe clears nlp.pipe_names

ceteri commented 4 years ago

@fukidzon @asajatovic

The points above identify two issues in the spacy_udpipe implementation. It might be more efficient to make a pull request on that project to resolve those issues, rather than adapt to them?

asajatovic commented 4 years ago

@ceteri You can find the explanation for only ['textrank'] showing up in nlp.pipe_names at https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L75-L76.

@fukidzon @ceteri Regarding the doc.noun_chunks property, it is built from a dependency parsed document https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L577. If you peek a little deeper into spaCy source code, you'll notice some languages have an implementation of a proper syntax iterator (https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L206, i.e. English https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py#L7) and others don't. The idea behind spacy-udpipe is to be a lightweight wrapper around the underlying UDPipe models. As the dependecy labels used both in spaCy and udpipe-spacy use the UD scheme for languages other than English and German, I believe the only required thing for doc.noun_chunks to work is a proper syntax iterator implementation. Taking all of this into account, I suggest you try the approach using doc._.custom_noun_chunks or try implementing the syntax iterator for your language. Hope this helps to solve your issue! :)

ceteri commented 4 years ago

Thank you kindly @asajatovic, that's good to know and makes a lot of sense to use that approach.

@fukidzon I can help with a syntax iterator implementation. To start I'd need more about a language sample and expected output -- there are core models in spaCy for the other languages in which I'm conversant.

fukidzon commented 4 years ago

@ceteri @asajatovic thank you for the comments!

I created a colab notebook with custom noun_chunks example for Slovak: https://colab.research.google.com/drive/1tLMUMpFTGvxvp32YQYF5LC-nlTlUdtYz

To create a syntax_iterators for Slovak language would be the best solution - I was already checking it but I think it needs a deeper look into the language structure to make it correctly (the best would be if it's a part of spaCy code, not just a local workaround )

We may be able to add a use_chunks flag that's True by default

I like the idea, that it can be possible to provide some other source of "noun_chunks"

fukidzon commented 4 years ago

I found a solution:

import spacy_udpipe, spacy, pytextrank
from spacy.matcher import Matcher
from spacy.attrs import POS

def get_chunks(doc):
    np_label = doc.vocab.strings.add("NP")
    matcher = Matcher(nlp.vocab)
    pattern = [{POS: 'ADJ', "OP": "+"}, {POS: {"IN": ["NOUN", "PROPN"]}, "OP": "+"}]
    matcher.add("Adjective(s), (p)noun", None, pattern)
    matches = matcher(doc)

    for match_id, start, end in matches:
        yield start, end, np_label

#spacy_udpipe.download("sk") # download model
nlp = spacy_udpipe.load("sk")
nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}  #noun_chunk replacement

tr = pytextrank.TextRank(logger=None)
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

text = "Wikipédia je webová encyklopédia s otvoreným obsahom, ktorú možno slobodne čítať aj upravovať. Je sponzorovaná neziskovou organizáciou Wikimedia Foundation. Má 285 nezávislých jazykových vydaní vrátane slovenského a najrozsiahlejšieho anglického. Popri článkoch encyklopedického typu obsahuje, najmä anglická encyklopédia, aj články podobajúce sa almanachu, atlasu či stránky aktuálnych udalostí. Wikipédia je jedným z najpopulárnejších zdrojov informácií na webe s približne 13 miliardami zobrazení mesačne. Jej rast je skoro exponenciálny. Wikipédii (takmer 2 milióny). Wikipédia bola spustená 15. januára 2001 ako doplnok k expertmi písanej Nupedii. So stále rastúcou popularitou sa Wikipédia stala podhubím pre sesterské projekty ako Wikislovník (Wiktionary), Wikiknihy (Wikibooks) a Wikisprávy (Wikinews). Jej články sú upravované dobrovoľníkmi vo wiki štýle, čo znamená, že články môže meniť v podstate hocikto. Wikipediáni presadzujú politiku „nestranný uhol pohľadu“. Podľa nej relevantné názory ľudí sú sumarizované bez ambície určiť objektívnu pravdu. Vzhľadom na to, že Wikipédia presadzuje otvorenú filozofiu, jej najväčším problémom je vandalizmus a nepresnosť. "
doc = nlp(text)

print("Noun chunks:")
for nc in doc.noun_chunks:
    print(nc)

print("\nKeywords:")
for phrase in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)    

I'm not sure how clean is this workaround nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}, but it works (It's based on how are the noun_chunks defined in syntax_iterators.py and __init__.py in spaCy/lang/en)

ghost commented 4 years ago

I have much the same problem using the pt_core_news_sm model. Noun_chunks are empty since there is no appropriate syntax_iterator. I tested this solution there and it's also able to produce results now. However, I noticed that it seems that only the noun_chunks can be returned as keywords. I'm not sure why that is.

I'm not sure how clean is this workaround nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}, but it works (It's based on how are the noun_chunks defined in syntax_iterators.py and __init__.py in spaCy/lang/en)

Adding a syntax_iterator seems like the cleanest thing to do. The only concern I would have with the presented solution is that it requires the parser to be after the tagger in the pipeline.

andremacola commented 3 years ago

Hi, I'm trying to use nlp.Defaults.syntax_iterators with spaCy v3 with no success. My language (pt) does not have the syntax_iterator.py file in the spacy lang folder.

Is this only works with spacy_udpipe ? Which I'm not using.

ceteri commented 3 years ago

Hi @andremacola, could you help us by showing some example code about the pipeline you're building with spaCy 3.x ? We may be able to help.

The code for syntax_iterators is in https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py

Also, if this is more of a spaCy question, we could move this thread to https://github.com/explosion/spaCy/discussions/ to get more help.