Open ablaom opened 4 years ago
* Some models do not implement a transform method (presumably because python scikit-learn does not) but could, no? Example: DBSCAN
Hmm, I think this requires a bit more thinking. For instance what would you suggest the transformation of DBSCAN would correspond to? DBScan doesn't have centroids etc, you run the algorithm and you get an "answer" in the form of labels. So the only meaningful thing I would see here is to map the transform to the predict which just gets those labels. But I think this is confusing and would prefer the distinction. What do you think?
* Recall that MLJ makes a distinction between `report` and `fitted_params`: the latter is for the learned parameters (in this case what is needed to assign a new observation to a class), and everything else goes in `report`. It seems that in the scikitlearn clustering wraps everything is just lumped into `fitted_params`. In particular this has led to inconsistency with the Clustering.jl models KMeans and KMedoids (which separate things correctly, as far as I can tell).
Yes so what I did for all sklearn models is to dump in fitted params the output of the fit
method of Sklearn. I think it makes sense to be close to scikitlearn so that people know where to find things but granted this is probably not ideal with MLJ's internal report/fitted_params split.
I'm not very happy with this separation report
and fitted_params
(sorry for bringing this back up), I understand their purpose but in practice I'd just want to bother with one of them. Can I maybe suggest that fitted_params
be part of the report
? i.e. that all of the fields of fitted_params
are also returned in report
? The big advantage is that it makes things easier to find, you can just tell people to use report
always while internally we use fitted_params
.
Anyway back on topic: (1) I need your thoughts first (2) the refactoring is doable but will be annoying and I'm wondering whether adiding fitted_params
in the report
might not just make our life generally easier.
@tlienart You are right. DBSCAN is not like KMeans clustering. I stand corrected. However, I do wonder if the sk-learn way of conceptualising this class of clustering problems is the most useful. Might it not be better to view DBSCAN as a static transformer that transforms features X
into labels y
(or perhaps, labelled features (X, y)
). So there is no fit
method - or rather fit
in our implementation - would be a no-op. (Perhaps this is what you are also suggesting above?) The problem with this is there is then no interface point for returning any other information about transformation procedure, apart from the labels. What are your thoughts?
I'm not very happy with this separation report and fitted_params (sorry for bringing this back up), I understand their purpose but in practice I'd just want to bother with one of them. Can I maybe suggest that fitted_params be part of the report? i.e. that all of the fields of fitted_params are also returned in report?
There was quite a bit of discussion in deciding upon this design where good points were made for separating the two interface points, in my view. I think there would need to be a pretty strong argument for changing the design now. I do think the name report
is unfortunate, as is a bit vague. In retrospect extras
would have been better. So then people are less likely to look for learned parameters there?
I think these methods could do with a review. Here are a few things that look like issues to me:
Some models do not implement a transform method (presumably because python scikit-learn does not) but could, no? Example: DBSCAN
Recall that MLJ makes a distinction between
report
andfitted_params
: the latter is for the learned parameters (in this case what is needed to assign a new observation to a class), and everything else goes inreport
. It seems that in the scikitlearn clustering wraps everything is just lumped intofitted_params
. In particular this has led to inconsistency with the Clustering.jl models KMeans and KMedoids (which separate things correctly, as far as I can tell).cc: @tlienart