BERTopic: visualize_topics_over_time_area

For my research work, I was looking for ways to 'present' my topics_over_time more visually appealing. The hover-on topics trend over time is great. How about a sort of river-flow or area map plot: a reader can 'quickly the volume' of change over time! PS: I'm open to a more pythonic way of getting this done. PS: I'm open to more BERTopic's under-the-hood hooks

NB: To reduce choice overload and still keep with BERTopic philosophy of being basic, we can simply implement this with an arg within the existing visualize_topics_over_time.

[My approach]

in plotting/init | include entry for visualize_topics_over_time_area

{extracts}
from ._topics_over_time_area import visualize_topics_over_time_area
- __all__ = [
...,
"visualize_topics_over_time_area"
]

in plotting, add a new file visualize_topics_over_time_area.py | The base is visualize_topics_over_time.py

NB: the visualize_topics_over_time_area.py file has comments that I need to clean up.

{extracts}
marker_color=colors[index % len(colors)], ##SemmyK: [index % 11], :for topics >7
stackgroup='one'     ##SemmyK: Key to area plot
...
hovermode='x unified',             ##SemmyK: single hover label. Perhaps, I should use 'x'.
bgcolor="rgba(0,0,0,.05)",  #"white", ##SemmyK:adjusted transparency
bordercolor = "rgba(0,0,0,0)"  ##SemmyK:remove line border

In my working file, I extend BERTopic to use the extended functionality


{extracts}
class BERTopicAreaPlot(BERTopic):
def visualize_topics_over_time_area(self, ... ...

check_is_fitted(self) return plotting.visualize_topics_over_time_area(self,


{extracts}
- To visualise, I called model_ngram_dtm_area.visualize_topics_over_time_area(... ...)
- Sample view
![image](https://user-images.githubusercontent.com/113531105/198902518-cef69e19-30fb-4851-90ec-e847f1eb9d98.png)

I'm open to a pull request.

I believe this relates to a comment I have seen a while ago here. My main concern with the area map visualization is that it does not handle a large amount of topics well out of the box. This effect is especially strong when there are multiple topics that differ quite significantly across time which would then muddle the resulting visualization. The image that you give for example is the most ideal data representation but that happens quite rarely in practice and requires careful selection of which topics to show. Something to fix this is a stacked area map but that is typically quite difficult to interpret since there is essentially a separate y-axis for each individual line.

Thanks @MaartenGr for the response and for pointing out @pariskang earlier suggestion (which I missed). Apology for only reverting back now. I've been 'away' on some research work.

I gave your concern and suggestion some thought. The current visualize_topics_over_time 'line' plot works great. No doubt about that. In the information system domain that I reside in (and to an extent in social sciences), showing the 'picture', that is the narrative' is often desired. That is where area map comes in. However, as you rightly pointed out, area maps have their limitations (and flaws, such as the Sine size illusion).

#Pause: Regarding the extracted image I gave, it was made possible, as you might have guessed, by the ability to select topics (one is interested in). A great feature that is great for narrative purposes.

Upon reflecting, and noting the aim, my proposition will be

Streamgraph plot (stream graph, River plot)
- Streamgraph comes in handy for visualising evolution for several groups (multiple variables); though it has its own limitation with # of groups to display before things get out of hand. To mitigate, to an extent, one must be very careful with the choice of colour blend (colour range).
- Streamgraph would 'resolve' the 'y-axis' dilemma of stacked area.
- In any case, streamgraph, as currently implemented, oft starts off from stacked area!
Plotting streamgraph
- native support for streamgraph is limited or not pervasive in Python (unlike R and some others).
- Most implementations of (the few) streamgraph in Python are done using Matplotlib
Streamgraph with Matplotlib Typically: a. stack plot b. baseline center leveraging groupby. See, for instance Thiago's approach. c. one can do some NumPy reshaping and smoothing

NB: Fortunately, Matplotlib has recognised the value of streamgraphs and provided a 'native' example, as seen here: by adjusting baseline and 'smoothing' with gaussian.
- A Python 'walkthrough' by @holtzy on leveraging baseline (for axis), and smoothing (gaussian, grid, colour blend) Matplotlib's streamgraph
- [updated] Oops, I omitted using Python's Altair for streamgraph. See Cole Hagen's writeup here.
Streamgraph with Plotly
- Fortunately, @empet has attempted Python-based plot of streamgraph in Plotly.
- As I understand it, the trick is in trace 'name', 'type', ' shape', and layout
Proposed BERTopic approach I'll recommend as follows 5.1. [limit choice overload] We extend visualize_topics_over_time with arg option of type: str = None. The input parameters will be None (the default, where None masquerades as 'line'), type='area' (for area plot), and type='stream' (for streamgraph). We leverage Empet plotly attempt for streamgraph. 5.2 If choice overload is not a concern here, we keep visualize_over_time as-is (for line plot), and implement another BERTopic class method for visualize_topics_over_time_stream()
- the default here will be streamgraph, with the option of area map plot.

I should be able to put together some code for further engagement and pull request.

Thank you for taking the time to write this out and doing the research!

With respect to the streamgraph, I have a few similar concerns. They mostly relate to the interpretability from the y-axis themselves. Due to the, somewhat, non-existence of a y-axis in a streamgraph, interpreting lines on an individual level becomes quite difficult. There is the same problem with how busy graphs can be. Also, isn't the size illusion even more pronounced in a streamgraph since it has more sinus-like structures?

Having said that, I think your suggestion for having a parameter that changes the basic structure of the visualization is a nice way of making minimal API changes whilst still giving users the option to go for the type of graph that suits their needs best. Something like graph_type: str = "line" would be a nice implementation. In order to keep changes and upkeep of code minimal, I do propose only doing this for graph_type="fill" and graph_type="area" instead of the streamgraph since it is not natively supported by Plotly. As such, the changes would be rather minimal.

Add graph_type: str = "line" to the input parameters
Add fill="tozeroy" if graph_type == "fill" else None to go.Scatter
Add stackgroup="one" if graph_type == "are" else None to go.Scatter
Add documentation for graph_type in the docstrings

This would mean only three lines of code that were changed with some small documentation. What do you think?

Thanks @MaartenGr Love your suggestion/approach. {I'll love to see streamgraph ;-) I'll work on a hack when opportuned}. I understand the need for #minimal API changes or left behind with unsupported features!
PS: streamgraph has 'more' Sine illusion. Ironically, it aids the 'stream' flow in streamgraph!

My understanding is we use ternary operators in go.Scatter
We might need some adjusted to the visual (look n feel) in update_layout
While at it, we can give users 'control' on colours. Though this might come at a 'cost' with some users not mindful of 'contrast'.

Attempt (with comments: to clean up) #https://github.dev/semmyk-research/BERTopic/blob/ab7f3135c5166dbfeb4ba3e49b129fc93b491c86/bertopic/plotting/_topics_over_time.py#L6

def visualize_topics_over_time(topic_model,
                               topics_over_time: pd.DataFrame,
                               top_n_topics: int = None,
                               topics: List[int] = None,
                               normalize_frequency: bool = False,
                               custom_labels: bool = False,
                               width: int = 1250,
                               height: int = 450,
                               graph_type: str = "line",
                               colors: List[str] = None) -> go.Figure:
    """ Visualize topics over time

    Arguments:
        ... ...
        #SemmyK: 17Nov22
        graph_type: The type of graph to visualise. The options are:
                        'line' for default line plot
                        'fill' for filling up (selected) topics to y=0
                        'area' for (stacked) area chart
        colors: List of hex strings or named css color

fig.add_trace(go.Scatter(x=trace_data.Timestamp, y=y,
                                 mode = 'lines',
                                 marker_color = colors[index % len(colors)],    #marker_color=colors[index % 7] #SemmyK: allow color size ,
                                 hoverinfo = "text",
                                 name = topic_name,
                                 hovertext = [f'<b>Topic {topic}</b><br>Words: {word}' for word in words if len(words)>1], ##SemmyK:  if len(words)>1 | insert for safeguard
                                 fill = 'tozeroy' if graph_type=='fill' else None, #SemmyK: ternary per issue #813
                                 stackgroup = 'one' if graph_type=='area' else None  #SemmyK: ternary operator per issue #813 for (stack) area plot
                                 ))

    fig.update_layout(
       ...
        hovermode='x unified',             ##SemmyK: single hover label. Has 'side effect' with large selection. 
        hoverlabel=dict(
            bgcolor="rgba(0,0,0,.05)",  #"white", ##SemmyK:adjusted transparency
            font_size=12, #16,             ##SemmyK: lower font #12 works great for me. NB: personal preference
            font_family="Rockwell",
            bordercolor = "rgba(0,0,0,0)"  ##SemmyK:remove line border. Visually appealing
        ),

{I'll love to see streamgraph ;-) I'll work on a hack when opportuned}.

Before you start working on that, I am not sure if a hack to create such a graph is something that fits within the stability of BERTopic. I want to prevent such hacks as much as possible as I cannot guarantee any support with respect to visualizations that are not natively supported by Plotly.

PS: streamgraph has 'more' Sine illusion. Ironically, it aids the 'stream' flow in streamgraph!

Yep, that's what I indeed expected.

My understanding is we use ternary operators in go.Scatter

Yes, only a few lines of code would be needed to be changed in order to implement a selection between line, area, and fill.

While at it, we can give users 'control' on colours. Though this might come at a 'cost' with some users not mindful of 'contrast'. We might need some adjusted to the visual (look n feel) in update_layout

I am not so sure about changing the look as of right now. Most visualizations have a similar style and changing one would often result in changing all others.

Attempt (with comments: to clean up) #https://github.dev/semmyk-research/BERTopic/blob/ab7f3135c5166dbfeb4ba3e49b129fc93b491c86/bertopic/plotting/_topics_over_time.py#L6

I would focus on first creating minimal changes:

graph_type parameter
graph_type documentation
The suggested changes in go.Scatter
- fill = 'tozeroy' if graph_type=='fill' else None,
- stackgroup = 'one' if graph_type=='area' else None

That way, significant functionality is added without only a few lines of code whilst keeping true to the main concern, the types of graph instead of things like colors and style which also touch upon other visualizations.

MaartenGr / BERTopic

BERTopic: visualize_topics_over_time_area #813