StevenReitsma / gensim-sklearn-wrapper

A scikit-learn wrapper for the gensim package for easy usage through scikit-learn's Pipeline and GridSearchCV classes.
MIT License
10 stars 2 forks source link

How to use with gridsearch? #1

Open hurcy opened 7 years ago

hurcy commented 7 years ago

I've got this error while I was fitting by GridSearchCV.

If no scoring is specified, the estimator passed should have a 'score' method. The estimator LdaTransformer(alpha='symmetric', chunksize=2000, decay=0.5, distributed=False, eta=None, eval_every=10, gamma_threshold=0.001, iterations=50, n_latent_topics=100, passes=1, update_every=1) does not.

So, I read the manual(http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator). Some functions should be implemented to use GridSearchCV.

How did you do it?

StevenReitsma commented 7 years ago

It's been a while since I made this. Back then I don't think having a fit_transform() and a score() was required. Adding fit_transform() should be trivial:

def fit_transform(self, X, y=None):
    return self.fit(X, y).transform(X)

For your error with the score() method: indeed the last step in a pipeline or GridSearch should have a score method. However, usually you don't have LDA or LSI as the last step in your pipeline, since it's often a preprocessing step for a classifier. In theory you can add a score() method to the LsiTransformer and LdaTransformer classes, but that wouldn't necessarily make sense. It's quite hard to determine the goodness of fit of LDA/LSI since it just creates topic embeddings, which aren't inherently good or bad. I would consider adding a classifier to your pipeline and using that plus your document labels to determine the goodness of fit of your LDA preprocessing (keeping the classifier parameters constant).

Feel free to comment if you have further questions!

hurcy commented 7 years ago

@StevenReitsma Thanks for your answer. Now I understand why you named it LdaTransformer.

I think perplexity, topic_coherence can be qualitative metrics to determine the goodness of fit. Since we need to determine the number of topics for LDA, I think score() function can help to choose the best number of topics.

StevenReitsma commented 7 years ago

Thanks for those links. Looks like you can definitely use those metrics to get an approximation of the goodness of fit and that should be fine if your ultimate goal is to have a good topic coherence or a good perplexity. However, in a real world use-case your goal is usually not to have a good topic coherence or a good perplexity but to have a good classification or regression performance. Hence my comment to add a classifier in your pipeline to be sure of the performance of your actual problem.

But again, if you're instead working on a research problem where the goal is to have good topic coherence, perplexity, or another metric, then using those to do a GridSearch should be a perfect solution! Adding that as a score() method to the classes shouldn't be too hard as the gensim models expose perplexity and topic coherence.

hurcy commented 7 years ago

@StevenReitsma Thanks again. I'm considering your comment!