Scikitlearn clustering methods cleanup

ablaom commented 4 years ago

I think these methods could do with a review. Here are a few things that look like issues to me:

Some models do not implement a transform method (presumably because python scikit-learn does not) but could, no? Example: DBSCAN
Recall that MLJ makes a distinction between report and fitted_params: the latter is for the learned parameters (in this case what is needed to assign a new observation to a class), and everything else goes in report. It seems that in the scikitlearn clustering wraps everything is just lumped into fitted_params. In particular this has led to inconsistency with the Clustering.jl models KMeans and KMedoids (which separate things correctly, as far as I can tell).

cc: @tlienart

tlienart commented 4 years ago

* Some models do not implement a transform method (presumably because python scikit-learn  does not) but could, no? Example: DBSCAN

Hmm, I think this requires a bit more thinking. For instance what would you suggest the transformation of DBSCAN would correspond to? DBScan doesn't have centroids etc, you run the algorithm and you get an "answer" in the form of labels. So the only meaningful thing I would see here is to map the transform to the predict which just gets those labels. But I think this is confusing and would prefer the distinction. What do you think?

* Recall that MLJ makes a distinction between `report` and `fitted_params`: the latter is for the learned parameters (in this case what is needed to assign a new observation to a class), and everything else goes in `report`. It seems that in the scikitlearn clustering wraps everything is just lumped into `fitted_params`. In particular this has led to inconsistency with the Clustering.jl models KMeans and KMedoids (which separate things correctly, as far as I can tell).

Yes so what I did for all sklearn models is to dump in fitted params the output of the fit method of Sklearn. I think it makes sense to be close to scikitlearn so that people know where to find things but granted this is probably not ideal with MLJ's internal report/fitted_params split.

I'm not very happy with this separation report and fitted_params (sorry for bringing this back up), I understand their purpose but in practice I'd just want to bother with one of them. Can I maybe suggest that fitted_params be part of the report? i.e. that all of the fields of fitted_params are also returned in report? The big advantage is that it makes things easier to find, you can just tell people to use report always while internally we use fitted_params.

Anyway back on topic: (1) I need your thoughts first (2) the refactoring is doable but will be annoying and I'm wondering whether adiding fitted_params in the report might not just make our life generally easier.

ablaom commented 4 years ago

@tlienart You are right. DBSCAN is not like KMeans clustering. I stand corrected. However, I do wonder if the sk-learn way of conceptualising this class of clustering problems is the most useful. Might it not be better to view DBSCAN as a static transformer that transforms features X into labels y (or perhaps, labelled features (X, y)). So there is no fit method - or rather fit in our implementation - would be a no-op. (Perhaps this is what you are also suggesting above?) The problem with this is there is then no interface point for returning any other information about transformation procedure, apart from the labels. What are your thoughts?

I'm not very happy with this separation report and fitted_params (sorry for bringing this back up), I understand their purpose but in practice I'd just want to bother with one of them. Can I maybe suggest that fitted_params be part of the report? i.e. that all of the fields of fitted_params are also returned in report?

There was quite a bit of discussion in deciding upon this design where good points were made for separating the two interface points, in my view. I think there would need to be a pretty strong argument for changing the design now. I do think the name report is unfortunate, as is a bit vague. In retrospect extras would have been better. So then people are less likely to look for learned parameters there?

JuliaAI / MLJModels.jl

Scikitlearn clustering methods cleanup #209