DerwenAI / spaCy_tuTorial

A brief tutorial for spaCy 2.x, which runs on Google Colab
https://derwen.ai/s/d5c7
MIT License
25 stars 11 forks source link

Question on the challenge's extras #2

Closed louisguitton closed 3 years ago

louisguitton commented 4 years ago

Hi, thank you for this awesome resource. Really helpful to implement TextRank step by step.

1. Question on the extra

I was wondering if you could share some pointers regarding the last question:

how could you use noun chunks and entities within the graph?

So far, I've thought of building a bipartite graph with noun_chunks at the top and entities at the bottom. Edges could be either similarity as a weight (resulting in a highly connected graph), or inclusion (if ent.text in nc.text) (resulting in a loosely connected graph like for the vanilla TextRank). Then you get the top ranked noun chunks or entities mixed up and I'm not so sure about how useful it is.

2. Question for next steps

In your work on pytextrank, have you looked at (or come across) using other graph algorithms on the text graph? I was trying to find networkx' implementation of the greedy algorithm for finding densest subgraphs but couldn't find it. I could find networkx.algorithms.approximation.kcomponents.k_components that looks promising. Also networkx.algorithms.components.articulation_points when they exist seem to carry quite some information.

ceteri commented 4 years ago

Thank you kindly @louisguitton!

Those questions added to prompt people to discuss about ways to extend TextRank. For the "extras" I was thinking about ways to enrich a graph, so that the algorithm would tend to produce better output. For example, that could extend use of spacy-wordnet to have a more "directed search" use of TextRank for a particular domain.

The bipartite graph you've described sounds like a good approach. The entities found by NER tend to be a subset of the noun chunks, and also there can be multiple entities associated with the same noun chunk (Paris is a city, Paris is my friend in Hobart) - so a graph works well to represent those cases.

Mostly I was thinking about problems related to disambiguation within a graph -- which I a problem we had to tackle in an earlier knowledge graph project that used PyTextRank.

ceteri commented 4 years ago

Great question about other graph algorithms! Most definitely!! Some general thoughts about that...

I'm very interested in means for automated or semi-automated construction of knowledge graphs. There was a talk at IIT Patna a couple weeks ago about PyTextRank, which begins to discuss an approach toward the last few slides: https://derwen.ai/s/h88s https://youtu.be/ZwlPsdRDtMI

If you have a large corpus of docs, one approach is to parse them, use this library to extract ranked phrases, then use those results as input for vector embedding. It can work to build a reasonably good first approximation of a KG, although it will be noisy.

One of the ways to help filter out the noise in that case is a disparity filter -- which I hadn't seen implemented or used much otherwise.

Use of k-core could be useful for similar reasons, when "pruning" a generated graph. Perhaps even more general-purpose is persistent homology and topological data analysis -- although not quite available in networkx yet :)

Use of eccentricity may be helpful, to show min distance between nodes in a graph. In other words if the parsed and ranked phrases can be linked to entities in a graph, then one use is to find the minimum distance between entities, and be able to express "where" a document's top-ranked phrases "locate" its content with respect to particular topic regions in the graph. That's sort of like "playing backwards" the extractive summarization.

Those are some ideas. Does that help?

BTW, in terms of graph algorithms, are there particular use cases that you're considering?

louisguitton commented 4 years ago

Those are some ideas. Does that help?

Thanks a lot for this. Yes this helps me a lot in structuring what I can look into next ! There is definitely a lot to unpack for me as I hadn't heard of topological data analysis and eccentricity yet. I've gone through that presentation once more and will watch that talk.

BTW, in terms of graph algorithms, are there particular use cases that you're considering?

My use case is that I have a corpus of football news articles that I need to label with tags from a 10k set. I've explored framing this as a multi-label textcat, and it fails because the tag set is too big. I've explored Entity Linking (implementing TAGME to disambiguate NER mentions) but - although it works in terms of information extraction - it fails for us as a business because it doesn't comply with the fair bit of business rules that should govern the way we tag things (eg: in the case the article is about a game, we should tag both teams and not miss one etc ...). Now I'm reframing this as a graph problem, because it has the benefit to provide new pruning options (eg: textrank the mentions + select top k mentions where k=|V| / 3). With NER mentions the pruning "comes on top" (right now I disambiguate with TAGME and prune with relatedness as measured by Wikipedia link-based measure) but with a graph approach, each graph algorithm gives you a new pruning perspective and I'm thinking I could combine them: