DerwenAI / pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
https://derwen.ai/docs/ptr/
MIT License
2.13k stars 334 forks source link

Ideas for visualising key phrases together with text, as a modelling aid #194

Closed DayalStrub closed 1 year ago

DayalStrub commented 3 years ago

Just wanted to see what people thought about this...

I've been playing about with keyphrase extraction and, as well as looking at the altair plot pyTextRank produces, found it helpful to display the text with the key phrases. I ended up "hacking" the doc.ents and using spaCy's displacy, so it's not necessarily clean and therefore not sure how it could be added (as is), but thought I would share as I do think it would make a nice exploratory/modelling feature, similar to the extra viz functionality. On the other hand, it might be a common hack, and people might know it, but I haven't seen it elsewhere.

Here is an example output:

image

NOTE: It is only displaying the top 10 key phrases as the colours get quite busy, but one can easily drop the colouring.

And here is the code to reproduce and play with it:

# %%
import en_core_web_sm
import pytextrank
import random
import spacy

# %%
def generate_colour():
    random_number = random.randint(0, 16777215)
    hex_number = str(hex(random_number))
    hex_number = "#" + hex_number[2:]
    return hex_number

# %%
def hack_ents(doc, n_phrases=10, precision=5):
    phrases = doc._.phrases

    ## filter to top n_phrases
    if (n_phrases is not None) and len(phrases) > n_phrases:
        phrases = phrases[0:n_phrases]

    keyphrases = []
    for p in phrases:
        if p.rank > 0:
            for chunk in p.chunks:
                chunk.label_ = str(round(p.rank, precision))
                keyphrases.append(chunk)
    ## NOTE removing keyphrases that overlap
    keyphrases = spacy.util.filter_spans(keyphrases)

    doc.ents = []
    doc.ents = keyphrases

    return doc

# %%
nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);

# %%
# from dat/lee.txt
text = """
After more than four hours of tight play and a rapid-fire endgame, Google's artificially intelligent Go-playing computer system has won a second contest against grandmaster Lee Sedol, taking a two-games-to-none lead in their historic best-of-five match in downtown Seoul.  The surprisingly skillful Google machine, known as AlphaGo, now needs only one more win to claim victory in the match. The Korean-born Lee Sedol will go down in defeat unless he takes each of the match's last three games. Though machines have beaten the best humans at chess, checkers, Othello, Scrabble, Jeopardy!, and so many other games considered tests of human intellect, they have never beaten the very best at Go. Game Three is set for Saturday afternoon inside Seoul's Four Seasons hotel.  The match is a way of judging the suddenly rapid progress of artificial intelligence. One of the machine-learning techniques at the heart of AlphaGo has already reinvented myriad online services inside Google and other big-name Internet companies, helping to identify images, recognize commands spoken into smartphones, improve search engine results, and more. Meanwhile, another AlphaGo technique is now driving experimental robotics at Google and places like the University of California at Berkeley. This week's match can show how far these technologies have come - and perhaps how far they will go.  Created in Asia over 2,500 year ago, Go is exponentially more complex than chess, and at least among humans, it requires an added degree of intuition. Lee Sedol is widely-regarded as the top Go player of the last decade, after winning more international titles than all but one other player. He is currently ranked number five in the world, and according to Demis Hassabis, who leads DeepMind, the Google AI lab that created AlphaGo, his team chose the Korean for this all-important match because they wanted an opponent who would be remembered as one of history's great players.  Although AlphaGo topped Lee Sedol in the match's first game on Wednesday afternoon, the outcome of Game Two was no easier to predict. In his 1996 match with IBM's Deep Blue supercomputer, world chess champion Gary Kasparov lost the first game but then came back to win the second game and, eventually, the match as a whole. It wasn't until the following year that Deep Blue topped Kasparov over the course of a six-game contest. The thing to realize is that, after playing AlphaGo for the first time on Wednesday, Lee Sedol could adjust his style of play - just as Kasparov did back in 1996. But AlphaGo could not. Because this Google creation relies so heavily on machine learning techniques, the DeepMind team needs a good four to six weeks to train a new incarnation of the system. And that means they can't really change things during this eight-day match.  "This is about teaching and learning," Hassabis told us just before Game Two. "One game is not enough data to learn from - for a machine - and training takes an awful lot of time."
"""

# %%
doc = nlp(text)

# %%
doc = hack_ents(doc)

labels = [e.label_ for e in doc.ents]
colours = {label: generate_colour() for label in labels}
options = {"colors": colours}

# options = {}

spacy.displacy.render(
    doc, style="ent", options=options, page=False, jupyter=True, minify=True
)
tomaarsen commented 3 years ago

This seems like a wonderful way to very quickly and clearly show both what PyTextRank does, and how it can be used. The produced image can work great as a graphical elevator pitch, and having another example of how PTR can be used is always preferred.

I'd be in favor of:

Perhaps the README image can be a link to the jupyter notebook directly. That way, developers can play around with PTR within minutes.

DayalStrub commented 3 years ago

On top of example/README, which you know better, I was wondering whether it could be useful similarly to plot_keyphrases, and be added as something like:

tr = doc._.textrank
tr.display_keyphrases()

but it would be similarly exploratory, so not sure whether you think it's worth adding to the codebase. And we would have to do it in a way that it doesn't actually affect doc.ents, maybe by creating a copy of doc.

Also, I think the colouring might be better on a scale of white to green, say, so one immediately, visually picks up on which phrases are more key, though it would possibly make the phrases less distinguishable.

tomaarsen commented 3 years ago

Also, I think the colouring might be better on a scale of white to green, say, so one immediately, visually picks up on which phrases are more key, though it would possibly make the phrases less distinguishable.

Regarding this, I agree. The current system seems to randomly pick some color, meaning that some borderline unreadable color can be picked. E.g. really dark purple, or borderline black.

DayalStrub commented 3 years ago

The current system seems to randomly pick some color, meaning that some borderline unreadable color can be picked.

Good point. I experimented with a colour scale based on score and even being lazy and quickly putting together something with matplotlib, I get this

image

which might be better, but still a bit on the dark side at the top of the scale.

Note: New generate_colours is

from matplotlib import cm
from matplotlib.colors import rgb2hex

def generate_colours(labels):
    oranges = cm.get_cmap('Oranges')
    labels = [float(label) for label in labels]
    ## better to normalise to 1 / len(doc) -> 0, and then use a red-yellow-green (RdYlGn) scale, given TextRank starts with uniform distribution of score?
    colours = {str(label): (label - min(labels)) / (max(labels) - min(labels)) for label in labels}
    colours = {label: oranges(colour) for label, colour in colours.items()}
    colours = {label: rgb2hex(colour) for label, colour in colours.items()}
    return colours

and the rest is mostly the same.

ceteri commented 2 years ago

This looks great! @DayalStrub thank you for all the work and suggestions on PyTextRank and if it'd help, we've got a Slack board for the committers – if you'd like to join and discuss further? Email me at paco AT derwen DOT ai for a link.

Hellisotherpeople commented 2 years ago

I am about to experiment with integrating this package into my webapp - https://huggingface.co/spaces/Hellisotherpeople/Unsupervised_Extractive_Summarization which is a (somewhat incomplete) port of my package CX_DB8 - https://github.com/Hellisotherpeople/CX_DB8

I feel compelled to link it here as I independently tackled this problem (visualizations of extractive summaries), and at the time I was unaware of this package (and I don't think it existed quite in its current form either!). It may help or at least give inspiration.

@DayalStrub @ceteri thank you both for the hard work on this project and making my life a LOT easier in the next few weeks. My goal is to have a webapp which hosts basically every single technique we can think of for extractive and query focused extractive summarization. This will also need to eventually include MMR and related methods (which are implemented in KeyBERT)