Closed fukidzon closed 1 year ago
Thank you @fukidzon
I've reworked the code that you provided as a Colab notebook: https://gist.github.com/ceteri/f3bfac641cffb61e10af5aae7eefc9dd so people can view and interact with the problem.
The root issue appears to be that the noun chunks produced by the UDpipe
extension are much less rich than what spaCy
produces? Or is that information available through some other field?
To address your main question:
PyTextRank
be extended to support doc._.custom_noun_chunks
as in https://github.com/explosion/spaCy/issues/3856 ?If you have an example of an extension that has implemented "3856" then yes we could add support for that in the next release.
Also, there's another implied question:
PyTextRank
be modified so that it does not depend on the availability of noun chunks ?While that's possible, and somewhat closer to the original algorithm description, it would a larger job to refactor the code.
I'll take a look, and try to scope it. We may be able to add a use_chunks
flag that's True
by default.
Back to your original question on StackOverflow could you provide a brief example text in sk
along with the code you're using to run that pipeline? If you could provide what phrases would be expected too, that would help lots!
Another issue that was mentioned:
I get tokens with POS and DEP tags, but there is nothing in doc._.phrases (doc.noun_chunks is also empty) and in nlp.pipe_names is just ['textrank']
See the gist it appears that spacy_udpipe
clears nlp.pipe_names
@fukidzon @asajatovic
The points above identify two issues in the spacy_udpipe
implementation. It might be more efficient to make a pull request on that project to resolve those issues, rather than adapt to them?
@ceteri You can find the explanation for only ['textrank'] showing up in nlp.pipe_names
at https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L75-L76.
@fukidzon @ceteri
Regarding the doc.noun_chunks
property, it is built from a dependency parsed document https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L577. If you peek a little deeper into spaCy source code, you'll notice some languages have an implementation of a proper syntax iterator (https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L206, i.e. English https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py#L7) and others don't. The idea behind spacy-udpipe
is to be a lightweight wrapper around the underlying UDPipe models. As the dependecy labels used both in spaCy
and udpipe-spacy
use the UD scheme for languages other than English and German, I believe the only required thing for doc.noun_chunks
to work is a proper syntax iterator implementation. Taking all of this into account, I suggest you try the approach using doc._.custom_noun_chunks
or try implementing the syntax iterator for your language. Hope this helps to solve your issue! :)
Thank you kindly @asajatovic, that's good to know and makes a lot of sense to use that approach.
@fukidzon I can help with a syntax iterator implementation. To start I'd need more about a language sample and expected output -- there are core models in spaCy
for the other languages in which I'm conversant.
@ceteri @asajatovic thank you for the comments!
I created a colab notebook with custom noun_chunks example for Slovak: https://colab.research.google.com/drive/1tLMUMpFTGvxvp32YQYF5LC-nlTlUdtYz
To create a syntax_iterators for Slovak language would be the best solution - I was already checking it but I think it needs a deeper look into the language structure to make it correctly (the best would be if it's a part of spaCy code, not just a local workaround )
We may be able to add a use_chunks flag that's True by default
I like the idea, that it can be possible to provide some other source of "noun_chunks"
I found a solution:
import spacy_udpipe, spacy, pytextrank
from spacy.matcher import Matcher
from spacy.attrs import POS
def get_chunks(doc):
np_label = doc.vocab.strings.add("NP")
matcher = Matcher(nlp.vocab)
pattern = [{POS: 'ADJ', "OP": "+"}, {POS: {"IN": ["NOUN", "PROPN"]}, "OP": "+"}]
matcher.add("Adjective(s), (p)noun", None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
yield start, end, np_label
#spacy_udpipe.download("sk") # download model
nlp = spacy_udpipe.load("sk")
nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks} #noun_chunk replacement
tr = pytextrank.TextRank(logger=None)
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
text = "Wikipédia je webová encyklopédia s otvoreným obsahom, ktorú možno slobodne čítať aj upravovať. Je sponzorovaná neziskovou organizáciou Wikimedia Foundation. Má 285 nezávislých jazykových vydaní vrátane slovenského a najrozsiahlejšieho anglického. Popri článkoch encyklopedického typu obsahuje, najmä anglická encyklopédia, aj články podobajúce sa almanachu, atlasu či stránky aktuálnych udalostí. Wikipédia je jedným z najpopulárnejších zdrojov informácií na webe s približne 13 miliardami zobrazení mesačne. Jej rast je skoro exponenciálny. Wikipédii (takmer 2 milióny). Wikipédia bola spustená 15. januára 2001 ako doplnok k expertmi písanej Nupedii. So stále rastúcou popularitou sa Wikipédia stala podhubím pre sesterské projekty ako Wikislovník (Wiktionary), Wikiknihy (Wikibooks) a Wikisprávy (Wikinews). Jej články sú upravované dobrovoľníkmi vo wiki štýle, čo znamená, že články môže meniť v podstate hocikto. Wikipediáni presadzujú politiku „nestranný uhol pohľadu“. Podľa nej relevantné názory ľudí sú sumarizované bez ambície určiť objektívnu pravdu. Vzhľadom na to, že Wikipédia presadzuje otvorenú filozofiu, jej najväčším problémom je vandalizmus a nepresnosť. "
doc = nlp(text)
print("Noun chunks:")
for nc in doc.noun_chunks:
print(nc)
print("\nKeywords:")
for phrase in doc._.phrases:
print("{:.4f} {:5d} {}".format(phrase.rank, phrase.count, phrase.text))
print(phrase.chunks)
I'm not sure how clean is this workaround nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}
, but it works (It's based on how are the noun_chunks defined in syntax_iterators.py
and __init__.py
in spaCy/lang/en
)
I have much the same problem using the pt_core_news_sm
model. Noun_chunks are empty since there is no appropriate syntax_iterator. I tested this solution there and it's also able to produce results now. However, I noticed that it seems that only the noun_chunks can be returned as keywords. I'm not sure why that is.
I'm not sure how clean is this workaround
nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}
, but it works (It's based on how are the noun_chunks defined insyntax_iterators.py
and__init__.py
inspaCy/lang/en
)
Adding a syntax_iterator seems like the cleanest thing to do. The only concern I would have with the presented solution is that it requires the parser to be after the tagger in the pipeline.
Hi, I'm trying to use nlp.Defaults.syntax_iterators
with spaCy v3 with no success. My language (pt) does not have the syntax_iterator.py file in the spacy lang folder.
Is this only works with spacy_udpipe
? Which I'm not using.
Hi @andremacola, could you help us by showing some example code about the pipeline you're building with spaCy 3.x ? We may be able to help.
The code for syntax_iterators
is in https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py
Also, if this is more of a spaCy question, we could move this thread to https://github.com/explosion/spaCy/discussions/ to get more help.
I wanted to use pytextrank together with spacy_udpipe to get keywords from texts in other languages (see https://stackoverflow.com/questions/59824405/spacy-udpipe-with-pytextrank-to-extract-keywords-from-non-english-text) but I realized, that udpipe-spacy somehow "overrides" the original spacy's pipeline so the noun_chunks are not generated (btw: the noun_chunks are created in lang/en/syntax_iterators.py but it doesn't exist for all languages so even if it is called, it doesn't work e.g. for Slovak language)
Pytextrank keywords are taken from the spacy doc.noun_chunks, but if the noun_chunks are not generated, pytextrank doesn't work.
Sample code:
Would it be possible that pytextrank processes the "nounchunks" (candidates for keywords) from a custom extension (function which uses a Matcher and the result is available e.g. as a doc..custom_noun_chunks - see https://github.com/explosion/spaCy/issues/3856 )?