Closed KeishiS closed 1 month ago
Thanks @KeishiS for the positive feedback and for posting.
I'm afraid, that when MLJTuning (or evaluate!
) resamples it has no way of knowing it is supposed to also apply the resampling to some hyperparameter.
It looks like you may have better luck with the LIBSVM version of the model (also provided an MLJ interface). In this case you can pass a kernel function rather than an explicit matrix, which won't suffer this issue, right? Would this suit your purpose?
For the record, it is theoretically possible to fix the sk-learn API. The proper interface point for "metadata" that needs to be resampled is to pass it along with the data. So, a corrected workflow would look something like
mach = machine(SVC(), X, y, kernel)
evaluate!(mach, resampling=...)
To implement this would require also adding a "data front end" to the MLJ interface, to articulate exactly how the resampling is to be done, because the default resampling of arrays (just resample the rows) doesn't work in this case.
Unfortunately, the MLJ sk-learn interfaces are created with a lot of metaprogramming and are therefore difficult to customise. So a fix here would be complicated.
cc @tylerjthomas9
Thank you for your reply! :smile:
I wasn't familiar with the concept of a "data front end", so I'll take some time to study the information at the link you provided.
While the example code creates a gram matrix from simple toy data, I'm currently considering using a graph kernel where processing multiple graphs in parallel would be more efficient. That's why I was hoping to use it as a precomputed kernel if possible. I appreciate your suggestion of the LIBSVM
. I'll try it.
Based on the information you've provided, I'll think about whether there might be a good alternative approach. For now, I'll close this issue. Thank you very much for taking the time to address my concerns.
First of all, thank you for the great work you're doing in maintaining this project. I encoutered what seems to be a bug when attempting to use a support vector classifier with a precomputed Gram matrix, while performing hyperparameter tuning using
TunedModel
. I would like to submit a pull request to address the issue, but I'm unsure which part of the codebase needs modification. Any advice would be greatly appreciated.Describe the bug When performing parameter search with TunedModel on an SVM with a precomputed kernel, the data splitting is not carried out properly.
To Reproduce
Expected behavior
During the process of searching for the best params, the Gram matrix
gmat
is divided into training data and test data. We expectgmat[train_idx, train_idx]
andgmat[test_idx, train_idx]
to be created. However, the current code splits it intogmat[train_idx, :]
andgmat[test_idx, :]
. This operation is executed in thefit_and_extract_on_fold
function inMLJBase.jl/src/resampling.jl
.Versions
I would be grateful for any advice on how to approach solving this issue. Thank you for taking the time to read and consider this matter!