bmabey / pyLDAvis

Python library for interactive topic model visualization. Port of the R LDAvis package.
BSD 3-Clause "New" or "Revised" License
1.8k stars 361 forks source link

[feature] decouple visualisation UI's topic numbering with their label #267

Open ed9w2in6 opened 5 months ago

ed9w2in6 commented 5 months ago

We have whole family of issues that are just about the numbering of topics during visualisation:

They can all be resolved just by decoupling the numbering from labels, which also remove the need of sort_topics, and start_index options in the python API.

Now I am not going into details on how to implement or specification of outcomes, but here are some ideas:

Outline

python API side

We currently generate topic numbers at topic_top_term_df in _prepare.py. We use enumerate and start_index to generate the numbering, in which it is supplied by user from prepare method, smuggled through _topic_info method. https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L276

Sorting is orthogonal to this logic, hence we can safely ignored it when changing such code: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L413-L416

The number generated from enumerate will ultimately be used to name the topic, stored as Category: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L265

I believe we should allow user to supply a list of strings.

If we change this we need to change this too: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L443-L449

and made sure none of them are named "Default", since we used it as default: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L237-L242

And that is for topic_info data only, we have to do the same of mdsData and token_table too. Clearly a better way is just to side-step it and just supply a desired list of names and store into the PreparedData namedtuple.

Solution: side step at JS visualisation side

Currently, our visualisation logic made hard assumptions that Category must be in the form of "TopicN" where N is a number: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/js/ldavis.js#L697-L701

Therefore, again, the path of lowest friction is to side-step it only changing the visualisation logic:

  1. RHS Table title https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/js/ldavis.js#L982-L987
  2. circle label https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/js/ldavis.js#L388-L393

In which 2 is optional. So only 3 changes in total!


Summary, changes needed

  1. new parameter for topic names
  2. store it at PreparedData
  3. change RHS Table title, optionally the circle labels too
msusol commented 5 months ago

Are you creating a matching pull request?

ed9w2in6 commented 5 months ago

@msusol Yes, still WIP though. Ideally cleaning up the code base would be better but I do not have such plans. My plan is to just, as mentioned above, a quick hack:

  1. adding new param at prepare, default to None, some logic to generate dummy topic name if None.
  2. store it at PreparedData
  3. change the visualisation accordingly:
    • RHS Table title
    • the circle labels too if it looked good.
    • allow select topic by topic name too, if not too difficult