Open hurcy opened 7 years ago
It's been a while since I made this. Back then I don't think having a fit_transform()
and a score()
was required. Adding fit_transform()
should be trivial:
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)
For your error with the score()
method: indeed the last step in a pipeline or GridSearch should have a score method. However, usually you don't have LDA or LSI as the last step in your pipeline, since it's often a preprocessing step for a classifier. In theory you can add a score()
method to the LsiTransformer
and LdaTransformer
classes, but that wouldn't necessarily make sense. It's quite hard to determine the goodness of fit of LDA/LSI since it just creates topic embeddings, which aren't inherently good or bad. I would consider adding a classifier to your pipeline and using that plus your document labels to determine the goodness of fit of your LDA preprocessing (keeping the classifier parameters constant).
Feel free to comment if you have further questions!
@StevenReitsma Thanks for your answer. Now I understand why you named it LdaTransformer.
I think perplexity, topic_coherence can be qualitative metrics to determine the goodness of fit. Since we need to determine the number of topics for LDA, I think score() function can help to choose the best number of topics.
Thanks for those links. Looks like you can definitely use those metrics to get an approximation of the goodness of fit and that should be fine if your ultimate goal is to have a good topic coherence or a good perplexity. However, in a real world use-case your goal is usually not to have a good topic coherence or a good perplexity but to have a good classification or regression performance. Hence my comment to add a classifier in your pipeline to be sure of the performance of your actual problem.
But again, if you're instead working on a research problem where the goal is to have good topic coherence, perplexity, or another metric, then using those to do a GridSearch should be a perfect solution! Adding that as a score()
method to the classes shouldn't be too hard as the gensim models expose perplexity and topic coherence.
@StevenReitsma Thanks again. I'm considering your comment!
I've got this error while I was fitting by GridSearchCV.
So, I read the manual(http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator). Some functions should be implemented to use GridSearchCV.
How did you do it?