MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.94k stars 741 forks source link

BERTopic: visualize_topics_over_time_area #813

Open semmyk-research opened 1 year ago

semmyk-research commented 1 year ago

For my research work, I was looking for ways to 'present' my topics_over_time more visually appealing. The hover-on topics trend over time is great. How about a sort of river-flow or area map plot: a reader can 'quickly the volume' of change over time! PS: I'm open to a more pythonic way of getting this done. PS: I'm open to more BERTopic's under-the-hood hooks

NB: To reduce choice overload and still keep with BERTopic philosophy of being basic, we can simply implement this with an arg within the existing visualize_topics_over_time.

[My approach]

check_is_fitted(self) return plotting.visualize_topics_over_time_area(self,


{extracts}
- To visualise, I called model_ngram_dtm_area.visualize_topics_over_time_area(... ...)
- Sample view
![image](https://user-images.githubusercontent.com/113531105/198902518-cef69e19-30fb-4851-90ec-e847f1eb9d98.png)

I'm open to a pull request.
MaartenGr commented 1 year ago

I believe this relates to a comment I have seen a while ago here. My main concern with the area map visualization is that it does not handle a large amount of topics well out of the box. This effect is especially strong when there are multiple topics that differ quite significantly across time which would then muddle the resulting visualization. The image that you give for example is the most ideal data representation but that happens quite rarely in practice and requires careful selection of which topics to show. Something to fix this is a stacked area map but that is typically quite difficult to interpret since there is essentially a separate y-axis for each individual line.

semmyk-research commented 1 year ago

Thanks @MaartenGr for the response and for pointing out @pariskang earlier suggestion (which I missed). Apology for only reverting back now. I've been 'away' on some research work.

I gave your concern and suggestion some thought. The current visualize_topics_over_time 'line' plot works great. No doubt about that. In the information system domain that I reside in (and to an extent in social sciences), showing the 'picture', that is the narrative' is often desired. That is where area map comes in. However, as you rightly pointed out, area maps have their limitations (and flaws, such as the Sine size illusion).

#Pause: Regarding the extracted image I gave, it was made possible, as you might have guessed, by the ability to select topics (one is interested in). A great feature that is great for narrative purposes.

Upon reflecting, and noting the aim, my proposition will be

  1. Streamgraph plot (stream graph, River plot)

    • Streamgraph comes in handy for visualising evolution for several groups (multiple variables); though it has its own limitation with # of groups to display before things get out of hand. To mitigate, to an extent, one must be very careful with the choice of colour blend (colour range).
    • Streamgraph would 'resolve' the 'y-axis' dilemma of stacked area.
    • In any case, streamgraph, as currently implemented, oft starts off from stacked area!
  2. Plotting streamgraph

    • native support for streamgraph is limited or not pervasive in Python (unlike R and some others).
    • Most implementations of (the few) streamgraph in Python are done using Matplotlib
  3. Streamgraph with Matplotlib Typically: a. stack plot b. baseline center leveraging groupby. See, for instance Thiago's approach. c. one can do some NumPy reshaping and smoothing

    NB: Fortunately, Matplotlib has recognised the value of streamgraphs and provided a 'native' example, as seen here: by adjusting baseline and 'smoothing' with gaussian.

    • A Python 'walkthrough' by @holtzy on leveraging baseline (for axis), and smoothing (gaussian, grid, colour blend) Matplotlib's streamgraph

    • [updated] Oops, I omitted using Python's Altair for streamgraph. See Cole Hagen's writeup here.

  4. Streamgraph with Plotly

    • Fortunately, @empet has attempted Python-based plot of streamgraph in Plotly.
    • As I understand it, the trick is in trace 'name', 'type', ' shape', and layout
  5. Proposed BERTopic approach I'll recommend as follows 5.1. [limit choice overload] We extend visualize_topics_over_time with arg option of type: str = None. The input parameters will be None (the default, where None masquerades as 'line'), type='area' (for area plot), and type='stream' (for streamgraph). We leverage Empet plotly attempt for streamgraph. 5.2 If choice overload is not a concern here, we keep visualize_over_time as-is (for line plot), and implement another BERTopic class method for visualize_topics_over_time_stream()

    • the default here will be streamgraph, with the option of area map plot.

I should be able to put together some code for further engagement and pull request.

MaartenGr commented 1 year ago

Thank you for taking the time to write this out and doing the research!

With respect to the streamgraph, I have a few similar concerns. They mostly relate to the interpretability from the y-axis themselves. Due to the, somewhat, non-existence of a y-axis in a streamgraph, interpreting lines on an individual level becomes quite difficult. There is the same problem with how busy graphs can be. Also, isn't the size illusion even more pronounced in a streamgraph since it has more sinus-like structures?

Having said that, I think your suggestion for having a parameter that changes the basic structure of the visualization is a nice way of making minimal API changes whilst still giving users the option to go for the type of graph that suits their needs best. Something like graph_type: str = "line" would be a nice implementation. In order to keep changes and upkeep of code minimal, I do propose only doing this for graph_type="fill" and graph_type="area" instead of the streamgraph since it is not natively supported by Plotly. As such, the changes would be rather minimal.

This would mean only three lines of code that were changed with some small documentation. What do you think?

semmyk-research commented 1 year ago

Thanks @MaartenGr Love your suggestion/approach. {I'll love to see streamgraph ;-) I'll work on a hack when opportuned}. I understand the need for #minimal API changes or left behind with unsupported features!
PS: streamgraph has 'more' Sine illusion. Ironically, it aids the 'stream' flow in streamgraph!

My understanding is we use ternary operators in go.Scatter
We might need some adjusted to the visual (look n feel) in update_layout
While at it, we can give users 'control' on colours. Though this might come at a 'cost' with some users not mindful of 'contrast'.

Attempt (with comments: to clean up) #https://github.dev/semmyk-research/BERTopic/blob/ab7f3135c5166dbfeb4ba3e49b129fc93b491c86/bertopic/plotting/_topics_over_time.py#L6

def visualize_topics_over_time(topic_model,
                               topics_over_time: pd.DataFrame,
                               top_n_topics: int = None,
                               topics: List[int] = None,
                               normalize_frequency: bool = False,
                               custom_labels: bool = False,
                               width: int = 1250,
                               height: int = 450,
                               graph_type: str = "line",
                               colors: List[str] = None) -> go.Figure:
    """ Visualize topics over time

    Arguments:
        ... ...
        #SemmyK: 17Nov22
        graph_type: The type of graph to visualise. The options are:
                        'line' for default line plot
                        'fill' for filling up (selected) topics to y=0
                        'area' for (stacked) area chart
        colors: List of hex strings or named css color
fig.add_trace(go.Scatter(x=trace_data.Timestamp, y=y,
                                 mode = 'lines',
                                 marker_color = colors[index % len(colors)],    #marker_color=colors[index % 7] #SemmyK: allow color size ,
                                 hoverinfo = "text",
                                 name = topic_name,
                                 hovertext = [f'<b>Topic {topic}</b><br>Words: {word}' for word in words if len(words)>1], ##SemmyK:  if len(words)>1 | insert for safeguard
                                 fill = 'tozeroy' if graph_type=='fill' else None, #SemmyK: ternary per issue #813
                                 stackgroup = 'one' if graph_type=='area' else None  #SemmyK: ternary operator per issue #813 for (stack) area plot
                                 ))
    fig.update_layout(
       ...
        hovermode='x unified',             ##SemmyK: single hover label. Has 'side effect' with large selection. 
        hoverlabel=dict(
            bgcolor="rgba(0,0,0,.05)",  #"white", ##SemmyK:adjusted transparency
            font_size=12, #16,             ##SemmyK: lower font #12 works great for me. NB: personal preference
            font_family="Rockwell",
            bordercolor = "rgba(0,0,0,0)"  ##SemmyK:remove line border. Visually appealing
        ),
MaartenGr commented 1 year ago

{I'll love to see streamgraph ;-) I'll work on a hack when opportuned}.

Before you start working on that, I am not sure if a hack to create such a graph is something that fits within the stability of BERTopic. I want to prevent such hacks as much as possible as I cannot guarantee any support with respect to visualizations that are not natively supported by Plotly.

PS: streamgraph has 'more' Sine illusion. Ironically, it aids the 'stream' flow in streamgraph!

Yep, that's what I indeed expected.

My understanding is we use ternary operators in go.Scatter

Yes, only a few lines of code would be needed to be changed in order to implement a selection between line, area, and fill.

While at it, we can give users 'control' on colours. Though this might come at a 'cost' with some users not mindful of 'contrast'. We might need some adjusted to the visual (look n feel) in update_layout

I am not so sure about changing the look as of right now. Most visualizations have a similar style and changing one would often result in changing all others.

Attempt (with comments: to clean up) #https://github.dev/semmyk-research/BERTopic/blob/ab7f3135c5166dbfeb4ba3e49b129fc93b491c86/bertopic/plotting/_topics_over_time.py#L6

I would focus on first creating minimal changes:

That way, significant functionality is added without only a few lines of code whilst keeping true to the main concern, the types of graph instead of things like colors and style which also touch upon other visualizations.