Closed DayalStrub closed 1 year ago
This seems like a wonderful way to very quickly and clearly show both what PyTextRank does, and how it can be used. The produced image can work great as a graphical elevator pitch, and having another example of how PTR can be used is always preferred.
I'd be in favor of:
examples
folder, either as a standalone jupyter notebook, or included in the existing sanmple.ipynb
.Perhaps the README image can be a link to the jupyter notebook directly. That way, developers can play around with PTR within minutes.
On top of example/README, which you know better, I was wondering whether it could be useful similarly to plot_keyphrases
, and be added as something like:
tr = doc._.textrank
tr.display_keyphrases()
but it would be similarly exploratory, so not sure whether you think it's worth adding to the codebase. And we would have to do it in a way that it doesn't actually affect doc.ent
s, maybe by creating a copy of doc
.
Also, I think the colouring might be better on a scale of white to green, say, so one immediately, visually picks up on which phrases are more key, though it would possibly make the phrases less distinguishable.
Also, I think the colouring might be better on a scale of white to green, say, so one immediately, visually picks up on which phrases are more key, though it would possibly make the phrases less distinguishable.
Regarding this, I agree. The current system seems to randomly pick some color, meaning that some borderline unreadable color can be picked. E.g. really dark purple, or borderline black.
The current system seems to randomly pick some color, meaning that some borderline unreadable color can be picked.
Good point. I experimented with a colour scale based on score and even being lazy and quickly putting together something with matplotlib, I get this
which might be better, but still a bit on the dark side at the top of the scale.
Note: New generate_colours
is
from matplotlib import cm
from matplotlib.colors import rgb2hex
def generate_colours(labels):
oranges = cm.get_cmap('Oranges')
labels = [float(label) for label in labels]
## better to normalise to 1 / len(doc) -> 0, and then use a red-yellow-green (RdYlGn) scale, given TextRank starts with uniform distribution of score?
colours = {str(label): (label - min(labels)) / (max(labels) - min(labels)) for label in labels}
colours = {label: oranges(colour) for label, colour in colours.items()}
colours = {label: rgb2hex(colour) for label, colour in colours.items()}
return colours
and the rest is mostly the same.
This looks great! @DayalStrub thank you for all the work and suggestions on PyTextRank
and if it'd help, we've got a Slack board for the committers – if you'd like to join and discuss further? Email me at paco AT derwen DOT ai for a link.
I am about to experiment with integrating this package into my webapp - https://huggingface.co/spaces/Hellisotherpeople/Unsupervised_Extractive_Summarization which is a (somewhat incomplete) port of my package CX_DB8 - https://github.com/Hellisotherpeople/CX_DB8
I feel compelled to link it here as I independently tackled this problem (visualizations of extractive summaries), and at the time I was unaware of this package (and I don't think it existed quite in its current form either!). It may help or at least give inspiration.
@DayalStrub @ceteri thank you both for the hard work on this project and making my life a LOT easier in the next few weeks. My goal is to have a webapp which hosts basically every single technique we can think of for extractive and query focused extractive summarization. This will also need to eventually include MMR and related methods (which are implemented in KeyBERT)
Just wanted to see what people thought about this...
I've been playing about with keyphrase extraction and, as well as looking at the altair plot pyTextRank produces, found it helpful to display the text with the key phrases. I ended up "hacking" the
doc.ents
and using spaCy'sdisplacy
, so it's not necessarily clean and therefore not sure how it could be added (as is), but thought I would share as I do think it would make a nice exploratory/modelling feature, similar to the extra viz functionality. On the other hand, it might be a common hack, and people might know it, but I haven't seen it elsewhere.Here is an example output:
NOTE: It is only displaying the top 10 key phrases as the colours get quite busy, but one can easily drop the colouring.
And here is the code to reproduce and play with it: