DerwenAI / pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
https://derwen.ai/docs/ptr/
MIT License
2.15k stars 333 forks source link

Adding support for other TextRank flavours, including PositionRank and Biased TextRank #78

Closed louisguitton closed 2 years ago

louisguitton commented 3 years ago

Papers

COLING'2020 will be happening next week (Dec 8-13, 2020, online). And one of the papers accepted there caught my attention. It's called (2020) Biased TextRank: Unsupervised Graph-Based Content Extraction, under the supervision of Rada Mihalcea. That paper lead me to read another paper: (2017) A Position-Biased PageRank Algorithm for Keyphrase Extraction, which actually seems to work really well for news data, which is my usecase.

Summary of different TextRank flavours

model (reference) TextRank (2004 paper) PositionRank (2017 paper) simple Biased TextRank (no ref) Biased TextRank (2020 paper)
Document Parse Each token with POS in ["ADJ", "NOUN", "PROPN", "VERB"] c.f. TextRank c.f. TextRank Tokens like TextRank if we’re looking for keywords.

Sentences if we’re looking for sentences.
Nodes for graph construction (lemma, POS) c.f. TextRank c.f. TextRank Language model embedding. For tokens: Word2VecFor sentences: LASER
Edges for graph construction Co-occurrence counts within a window of contiguous tokens c.f. TextRank c.f. TextRank Cosine similarity between nodes
Restart probabilities for pagerank Uniform distribution The weight of each candidate word is equal to its inverse position in the document. If the same word appears multiple times in the target document, then we sum all its position weights.

Then normalised.
Bespoke weighting in favour of the “task focus” = a short text taken as input that allows for topical focus.

Example weighting: 1 if keyword is in focus, 0 else.
Cosine similarity to the “task focus” = a short text taken as input that allows for topical focus.

Example of “task focus” for an election night news report = “Joe Biden” => will give you the relevant one-sided summary
Candidate generation Entities and noun chunks are split into tokens, with additive pagerank weight c.f. TextRank c.f. TextRank c.f. TextRank

Discussion

Before moving into implementation details, I'd love to get your high level thoughts on a couple points.

  1. Is adding more TextRank flavours to pytextrank of value?

  2. The rows of the above table appear to me as "components" and adding more TextRank flavours might require a small refactoring of the current class. Is that something that is too ambitious given pytextrank is used in a lot of places?

  3. PositionRank and the one I called "simple Biased TextRank" rely on personalised pagerank, which is supported in networkx:

    nx.pagerank(G, personalization={1:1, 2:0, 3:0, 4:0})
  4. Biased TextRank and "simple Biased TextRank" belong more to a semi-supervised setting as the user needs to provide a list of tokens (or a doc) for the "task focus".

ceteri commented 3 years ago

Wow, thank you kindly @louisguitton for the fantastic overview and comparisons of the relative features and needs!!

  1. Is adding more TextRank flavours to pytextrank of value?

In general yes adding more flavours would be super helpful for PyTextRank -- and @shyamcody had also suggested about Biased TextRank recently :) https://twitter.com/shyambhumukher1/status/1325260405472600064

FWIW, when I've checked the web site for Rada's lab (their link to code in the paper) there was a 404 error. I should give her a heads-up about that, although they also have an impl on GitHub https://github.com/ashkankzme/biased_textrank/tree/master/data

  1. The rows of the above table appear to me as "components" and adding more TextRank flavours might require a small refactoring of the current class. Is that something that is too ambitious given pytextrank is used in a lot of places?

Well said! We could stand some refactoring here, and it can likely be managed without disrupting/deprecating too much of the existing use cases. Perhaps a 3.x branch might be a good approach? Although a couple other points below might help in terms of roadmap and feature priorities

  1. PositionRank and the one I called "simple Biased TextRank" rely on personalised pagerank ...

Yes, that would like be the simplest to introduce. I'd been hoping to introduce personalised pagerank for a while, plus it's potentially a step toward some simpler of the entity linking approaches too.

ceteri commented 3 years ago

Some other factors:

First, I'm not sure how much PyTextRank will need to change based on adapting to spaCy 3.x features, or should change to leverage that more fully? I've been hoping to carve off lots of time over winter holidays to test and scope that better. That's on the horizon for roadmap for the most immediate PTR releases. But it shouldn't get in the way of contributions! I'll work toward supporting this however needed :)

Also, please bear with me, I'll try to articulate here something that I've struggled to say for a while ... Clearly the language embedding models, transformers, etc., of the entire "Sesame Street" since late 2017 have had enormous impact on natural language work. I like to leverage that, although I also recognize how PTR is part of the counterpoint argument for "not enormous models" and potentially leveraging domain knowledge. So it seems there's a potential trade-off there, in terms of the 2020 paper? In other words, I'd really like to pursue integration of the domain knowledge.

On the one hand, there's a lemma graph used inherently in TextRank, and I feel that it's important to distinguish that from a data graph that might be used for entity linking and other KG integration. IMO, Biased TextRank is steering away from that, and potentially mixing the two graphs in a way that won't be as useful long-term.

For example, in my spaCy tutorials I've shown some examples of using spaCy-wordnet along with PTR to bring in domain knowledge and optimize using semantic technologies.

In an earlier TextRank imp that ShareThis.com used in production beginning 2009, I'd shown how to enrich the lemma graph within TextRank to optimize results, i.e., linking in hypernyms and hyponyms from an external resource to add more edges into the lemma graph prior to running the centrality metrics. In other words, if you were parsing many documents about sports news, you could potentially use a KG representing sports news topics, their connections, their synonyms, hypernyms, hyponyms, etc., to enrich TextRank. A very nice side benefit is the entity linking almost becomes a side-effect of the optimization!

In the ways that I've built this previously, it requires a compute-intensive random walk across the external resource (KG, thesaurus, etc.) but the results were dramatic. FWIW, I did not pull that code over into PTR, because the integrations with WordNet were originally in Java, although @dvsrepo fixed that with the spaCy extension :)

Working toward entity linking support in PTR was one of my main reasons for starting kglab as a Py abstraction layer for building KGs. The general notion is that we could import a KG from kglab into the pipeline configuration for pytextrank to optimize for a given domain context.

Also, this may be useful in other pipeline integrations, such as biome-text -- this was a topic (NLP + KG) that Daniel and I first began to explore a few years ago when we started teaching and working together.

That said, use of KG with PTR pipelines could also help with restart probabilities and potentially provide a more generalized approach to embedding. For example, I really like this work in https://arxiv.org/abs/1709.02759 to describe a more generalized graph embedding formally, that preservices semantics. I think that would be more in keeping with the "small, fast models as counterpoint to enormous transformers" opinionated aspect of PTR and so many of its use cases. Of course, I may be biased! Much to consider here, though I hope some of these details help.

The kglab library is now at 0.1.4 release, and probably with another 2-3 months would be ready for pytextrank integration. I have a hunch this will be faster and more general-purpose than Biased TextRank, although in terms of the math it touches on some of the same approaches? E.g., leveraging node and edge similarity.

In any case, I'd like to be agnostic toward any one approach, and provide multiple options in PTR to support a wider range of applications.

louisguitton commented 3 years ago

First, I'm not sure how much PyTextRank will need to change based on adapting to spaCy 3.x features, or should change to leverage that more fully? I've been hoping to carve off lots of time over winter holidays to test and scope that better.

Same here, I need to make some time to check out Spacy v3 announcement, docs and what's new ...

I like to leverage that, although I also recognize how PTR is part of the counterpoint argument for "not enormous models" and potentially leveraging domain knowledge. [...] In other words, I'd really like to pursue integration of the domain knowledge.

I'm in team domain knowledge 200%. For my use cases that trumps the rest. Feels like for practitioners like me, Occam's razor is still winning. In terms of the 2020 paper, I'm not that interested by implementing the full fledged embedding based thing (the 4th column of my table). However, this gave me the idea for column 3 which 1) is not in the paper, 2) might have potential (although I haven't experimented yet) and 3) which I think belongs to team domain knowledge through the "task focus" input!

Much to consider here, though I hope some of these details help.

Yes thanks so much for this write up. It helps me put some of my thoughts in context of a more general approach towards KGs, as opposed to being influenced solely by my own use case.

The kglab library is now at 0.1.4 release, and probably with another 2-3 months would be ready for pytextrank integration. I have a hunch this will be faster and more general-purpose than Biased TextRank, although in terms of the math it touches on some of the same approaches?

Great. I've seen https://github.com/DerwenAI/kglab/issues/33 .

So if I rephrase in practical terms: on the PTR side I see 3 things raised from the present issue in the pytextrank side:

  1. Add PositionRank flavour by introducing personalised pagerank => simple and "orthogonal" to kglab, spacy v3 etc...
  2. Add what I called the "simple Biased Textrank" flavour, a TextRank that uses personalised pagerank with a topical focus => we keep that "orthogonal" from word embeddings, can potentially reuse some of the PositionRank code since it also uses personalised pagerank
  3. Refactor PTR slightly, possibly by introducing "components" that reflect the rows of the table, in order to allow for more flavours of TextRank in the future => to be put on hold before we assess spacy v3 efforts, doesn't block 1. and 2.

And I see 1 thing raised on the kglab side:

  1. Integrate kglab with pytextrank: "The general notion is that we could import a KG from kglab into the pipeline configuration for pytextrank to optimize for a given domain context."

So, I think starting with a PR for PositionRank would be a great first contribution for me. I'd be quite proud of that, and also keeping in mind that the personalised pagerank can build towards adding the "simple BiasedTextRank" flavour too and also is going in the same "big picture" direction than integrating kglab with pytextrank. Can't make promises on time with the holidays etc... but I've been willing to carve out some time to implement PositionRank for a while so I'm quite happy about how this issue is turning out.

ceteri commented 3 years ago

That looks great @louisguitton ! Adding a PR for PositionRank would add so much to PTR.

I'll work on the Integrate kglab with pytextrank item. Just finished getting the apidocs/mkdocs working and published on our Nginx/Gunicorn/Flask stack. Now that can be reused over here for documenting PTR too.

And FWIW I may have a use case for applying personalised pagerank in kglab later.

Happy New Year, and I wish you all the best in 2021!