Open semmyk-research opened 1 year ago
I believe this relates to a comment I have seen a while ago here. My main concern with the area map visualization is that it does not handle a large amount of topics well out of the box. This effect is especially strong when there are multiple topics that differ quite significantly across time which would then muddle the resulting visualization. The image that you give for example is the most ideal data representation but that happens quite rarely in practice and requires careful selection of which topics to show. Something to fix this is a stacked area map but that is typically quite difficult to interpret since there is essentially a separate y-axis for each individual line.
Thanks @MaartenGr for the response and for pointing out @pariskang earlier suggestion (which I missed). Apology for only reverting back now. I've been 'away' on some research work.
I gave your concern and suggestion some thought. The current visualize_topics_over_time 'line' plot works great. No doubt about that. In the information system domain that I reside in (and to an extent in social sciences), showing the 'picture', that is the narrative' is often desired. That is where area map comes in. However, as you rightly pointed out, area maps have their limitations (and flaws, such as the Sine size illusion).
#Pause: Regarding the extracted image I gave, it was made possible, as you might have guessed, by the ability to select topics (one is interested in). A great feature that is great for narrative purposes.
Upon reflecting, and noting the aim, my proposition will be
Streamgraph plot (stream graph, River plot)
Plotting streamgraph
Streamgraph with Matplotlib Typically: a. stack plot b. baseline center leveraging groupby. See, for instance Thiago's approach. c. one can do some NumPy reshaping and smoothing
NB: Fortunately, Matplotlib has recognised the value of streamgraphs and provided a 'native' example, as seen here: by adjusting baseline and 'smoothing' with gaussian.
A Python 'walkthrough' by @holtzy on leveraging baseline (for axis), and smoothing (gaussian, grid, colour blend) Matplotlib's streamgraph
[updated] Oops, I omitted using Python's Altair for streamgraph. See Cole Hagen's writeup here.
Streamgraph with Plotly
Proposed BERTopic approach I'll recommend as follows 5.1. [limit choice overload] We extend visualize_topics_over_time with arg option of type: str = None. The input parameters will be None (the default, where None masquerades as 'line'), type='area' (for area plot), and type='stream' (for streamgraph). We leverage Empet plotly attempt for streamgraph. 5.2 If choice overload is not a concern here, we keep visualize_over_time as-is (for line plot), and implement another BERTopic class method for visualize_topics_over_time_stream()
I should be able to put together some code for further engagement and pull request.
Thank you for taking the time to write this out and doing the research!
With respect to the streamgraph, I have a few similar concerns. They mostly relate to the interpretability from the y-axis themselves. Due to the, somewhat, non-existence of a y-axis in a streamgraph, interpreting lines on an individual level becomes quite difficult. There is the same problem with how busy graphs can be. Also, isn't the size illusion even more pronounced in a streamgraph since it has more sinus-like structures?
Having said that, I think your suggestion for having a parameter that changes the basic structure of the visualization is a nice way of making minimal API changes whilst still giving users the option to go for the type of graph that suits their needs best. Something like graph_type: str = "line"
would be a nice implementation. In order to keep changes and upkeep of code minimal, I do propose only doing this for graph_type="fill"
and graph_type="area"
instead of the streamgraph since it is not natively supported by Plotly. As such, the changes would be rather minimal.
graph_type: str = "line"
to the input parametersfill="tozeroy" if graph_type == "fill" else None
to go.Scatter
stackgroup="one" if graph_type == "are" else None
to go.Scatter
graph_type
in the docstringsThis would mean only three lines of code that were changed with some small documentation. What do you think?
Thanks @MaartenGr Love your suggestion/approach. {I'll love to see streamgraph ;-) I'll work on a hack when opportuned}. I understand the need for #minimal API changes or left behind with unsupported features!
PS: streamgraph has 'more' Sine illusion. Ironically, it aids the 'stream' flow in streamgraph!
My understanding is we use ternary operators in go.Scatter
We might need some adjusted to the visual (look n feel) in update_layout
While at it, we can give users 'control' on colours. Though this might come at a 'cost' with some users not mindful of 'contrast'.
Attempt (with comments: to clean up) #https://github.dev/semmyk-research/BERTopic/blob/ab7f3135c5166dbfeb4ba3e49b129fc93b491c86/bertopic/plotting/_topics_over_time.py#L6
def visualize_topics_over_time(topic_model,
topics_over_time: pd.DataFrame,
top_n_topics: int = None,
topics: List[int] = None,
normalize_frequency: bool = False,
custom_labels: bool = False,
width: int = 1250,
height: int = 450,
graph_type: str = "line",
colors: List[str] = None) -> go.Figure:
""" Visualize topics over time
Arguments:
... ...
#SemmyK: 17Nov22
graph_type: The type of graph to visualise. The options are:
'line' for default line plot
'fill' for filling up (selected) topics to y=0
'area' for (stacked) area chart
colors: List of hex strings or named css color
fig.add_trace(go.Scatter(x=trace_data.Timestamp, y=y,
mode = 'lines',
marker_color = colors[index % len(colors)], #marker_color=colors[index % 7] #SemmyK: allow color size ,
hoverinfo = "text",
name = topic_name,
hovertext = [f'<b>Topic {topic}</b><br>Words: {word}' for word in words if len(words)>1], ##SemmyK: if len(words)>1 | insert for safeguard
fill = 'tozeroy' if graph_type=='fill' else None, #SemmyK: ternary per issue #813
stackgroup = 'one' if graph_type=='area' else None #SemmyK: ternary operator per issue #813 for (stack) area plot
))
fig.update_layout(
...
hovermode='x unified', ##SemmyK: single hover label. Has 'side effect' with large selection.
hoverlabel=dict(
bgcolor="rgba(0,0,0,.05)", #"white", ##SemmyK:adjusted transparency
font_size=12, #16, ##SemmyK: lower font #12 works great for me. NB: personal preference
font_family="Rockwell",
bordercolor = "rgba(0,0,0,0)" ##SemmyK:remove line border. Visually appealing
),
{I'll love to see streamgraph ;-) I'll work on a hack when opportuned}.
Before you start working on that, I am not sure if a hack to create such a graph is something that fits within the stability of BERTopic. I want to prevent such hacks as much as possible as I cannot guarantee any support with respect to visualizations that are not natively supported by Plotly.
PS: streamgraph has 'more' Sine illusion. Ironically, it aids the 'stream' flow in streamgraph!
Yep, that's what I indeed expected.
My understanding is we use ternary operators in go.Scatter
Yes, only a few lines of code would be needed to be changed in order to implement a selection between line, area, and fill.
While at it, we can give users 'control' on colours. Though this might come at a 'cost' with some users not mindful of 'contrast'. We might need some adjusted to the visual (look n feel) in update_layout
I am not so sure about changing the look as of right now. Most visualizations have a similar style and changing one would often result in changing all others.
Attempt (with comments: to clean up) #https://github.dev/semmyk-research/BERTopic/blob/ab7f3135c5166dbfeb4ba3e49b129fc93b491c86/bertopic/plotting/_topics_over_time.py#L6
I would focus on first creating minimal changes:
graph_type
parametergraph_type
documentationgo.Scatter
fill = 'tozeroy' if graph_type=='fill' else None,
stackgroup = 'one' if graph_type=='area' else None
That way, significant functionality is added without only a few lines of code whilst keeping true to the main concern, the types of graph instead of things like colors and style which also touch upon other visualizations.
For my research work, I was looking for ways to 'present' my topics_over_time more visually appealing. The hover-on topics trend over time is great. How about a sort of river-flow or area map plot: a reader can 'quickly the volume' of change over time! PS: I'm open to a more pythonic way of getting this done. PS: I'm open to more BERTopic's under-the-hood hooks
NB: To reduce choice overload and still keep with BERTopic philosophy of being basic, we can simply implement this with an arg within the existing visualize_topics_over_time.
[My approach]
check_is_fitted(self) return plotting.visualize_topics_over_time_area(self,