chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Flexible setting for RC_PARAMS and the ability to incorporate other topic models #241

Closed zf109 closed 5 years ago

zf109 commented 5 years ago

context

First of all, thanks very much for the great tool, absolutely love the termite plot.

Currently I am using this for Japanese topic modelling and using GuidedLDA, which allows you to seed topic words and conforms with sklearn's interface (most of it). So it leads to 2 feature requests (currently I just hacked it :p ).

Enable setting RC_PARAMS within textacy/viz/termite.py

when try to display Japanese characters, the term label were not displayed correctly. To solve this, we need to set

plt.rcParams['font.family'] = <one of installed font supports japanese>
# in my case plt.rcParams['font.family'] = 'Arial Unicode MS'

but currently it is not possible with termite plot, because the line in textacy/viz/termite.py

with plt.rc_context(RC_PARAMS):
    ...

will reset the rc_context and the parameter RC_PARAMS is hardcoded in, and I have to get in and change 'font.family': ['Arial Unicode MS'],.

proposed solution

It would be good to enable setting the RC_PARAMS either through environment variables or just allow it to be passed in in some way. If it is required that the style is kept the same for termite plot, then would be good at least enable setting the font so that other languages can be displayed correctly.

Allows the TopicModel class to be constructed with other topic models.

As mentioned above I'm using GuidedLDA, it mostly conform with the interface for sklearn models. (has components_, fit, transform, n_topics etc.) but currently due to the construction method in TopicModel, it is not possible:

    def __init__(self, model, n_topics=10, **kwargs):
        if isinstance(model, (NMF, LatentDirichletAllocation, TruncatedSVD)): # <= this line enforce class type instead of allowing duck typing
            self.model = model
        else:
            self.init_model(model, n_topics=n_topics, **kwargs)

proposed solution

maybe just change the constructor of TopicModel to:

    def __init__(self, model, n_topics=10, **kwargs):
        if isinstance(model, (NMF, LatentDirichletAllocation, TruncatedSVD)):
            self.model = model
        elif all([hasattr(model, required_attr) for required_attr in {"components_", "fit", "n_topics"}]):
            self.model = model
        else:
            self.init_model(model, n_topics=n_topics, **kwargs)

and of course, change the error message when things go wrong :p

Hope the above makes sense. Again, thanks a lot for the great work, really likes the termite plot!

Let me know what do you think!

bdewilde commented 5 years ago

Hey @zf109 , thanks for the feature request(s)! These parts of the code base have been neglected for a while, and clearly they need some work. I can't promise anything in the immediate future, but your two changes are on my radar.

If you're comfortable, please feel free to submit a PR! You can find some helpful documentation here. Otherwise, I'll keep you posted!

zf109 commented 5 years ago

@bdewilde hey, sorry for the late reply, was a bit hectic in the work for last few weeks. Yeah I can try to do a quick PR.

Thanks very much.