[feature] decouple visualisation UI's topic numbering with their label

ed9w2in6 commented 5 months ago

We have whole family of issues that are just about the numbering of topics during visualisation:

numbering confusions
- 79
- 93
- 127
- 185
- 213
rename topic feature requests
- 92
bug due to incorrect indexing
- 265 (fix at #266

They can all be resolved just by decoupling the numbering from labels, which also remove the need of sort_topics, and start_index options in the python API.

Now I am not going into details on how to implement or specification of outcomes, but here are some ideas:

Outline

`python` API side

We currently generate topic numbers at topic_top_term_df in _prepare.py. We use enumerate and start_index to generate the numbering, in which it is supplied by user from prepare method, smuggled through _topic_info method. https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L276

Sorting is orthogonal to this logic, hence we can safely ignored it when changing such code: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L413-L416

The number generated from enumerate will ultimately be used to name the topic, stored as Category: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L265

I believe we should allow user to supply a list of strings.

If we change this we need to change this too: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L443-L449

and made sure none of them are named "Default", since we used it as default: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/_prepare.py#L237-L242

And that is for topic_info data only, we have to do the same of mdsData and token_table too. Clearly a better way is just to side-step it and just supply a desired list of names and store into the PreparedData namedtuple.

Solution: side step at JS visualisation side

Currently, our visualisation logic made hard assumptions that Category must be in the form of "TopicN" where N is a number: https://github.com/bmabey/pyLDAvis/blob/16800f36bc95b4c99d8c26d51daa3485c8cb76da/pyLDAvis/js/ldavis.js#L697-L701

Therefore, again, the path of lowest friction is to side-step it only changing the visualisation logic:

In which 2 is optional. So only 3 changes in total!

Summary, changes needed

new parameter for topic names
store it at PreparedData
change RHS Table title, optionally the circle labels too

msusol commented 5 months ago

Are you creating a matching pull request?

ed9w2in6 commented 5 months ago

@msusol Yes, still WIP though. Ideally cleaning up the code base would be better but I do not have such plans. My plan is to just, as mentioned above, a quick hack:

adding new param at prepare, default to None, some logic to generate dummy topic name if None.
store it at PreparedData
change the visualisation accordingly:
- RHS Table title
- the circle labels too if it looked good.
- allow select topic by topic name too, if not too difficult

bmabey / pyLDAvis