cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
547 stars 75 forks source link

Pass custom kernel to SVC #103

Open cljord opened 2 years ago

cljord commented 2 years ago

Not sure if I missed this in the docs or if this can be done with ScikitLearnBase, but in Scikit Learn, you can define a custom kernel function very easily and then pass it to the SVC during creation, like this:

def my_kernel(X, Y):
    """
    We create a custom kernel:

                 (2  0)
    k(X, Y) = X  (    ) Y.T
                 (0  1)
    """
    M = np.array([[2, 0], [0, 1.0]])
    return np.dot(np.dot(X, M), Y.T)

# we create an instance of SVM and fit out data.
clf = svm.SVC(kernel=my_kernel)

(from here)

I haven't been able to figure out how to do this with the Julia package and haven't found anything about it in the docs either. If it isn't possible, this would be a convenient feature (if it is, sorry for opening the issue and would be very thankful if somebody could point me in the right direction).

cstjean commented 2 years ago

What happens if you call SVC(kernel=some_julia_function)?

cljord commented 2 years ago

The custom kernel I wanted to pass was just a dot product, so I used dot(X, Y) from the LinearAlgebra package (that's when I opened the issue). I tried it again just now to recreate the error, and using dot(X, Y) throws an error, but weirdly, using X * Y' works fine (I'm assuming these are equivalent, haven't found any way to confirm that except for a bit of testing on my own).

Here's a minimal program showing the error

using DataFrames
using RDatasets: dataset
using ScikitLearn
using LinearAlgebra
using ScikitLearn.CrossValidation: train_test_split
@sk_import svm: SVC

iris = dataset("datasets", "iris")

X = convert(Array, select(iris, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]))
y = convert(Array, iris[!, :Species])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

function this_does_not_work(X, Y)
    return dot(X, Y)
end

clf = SVC(kernel=this_does_not_work)

fit!(clf, X_train, y_train)

function this_works(X, Y)
    return X * Y'
end

clf = SVC(kernel=this_works)

fit!(clf, X_train, y_train)

Using dot(X, Y') (or any other combination) didn't work either. The error I got was this:

ERROR: PyError ($(Expr(:escape, :(ccall(#= /Users/cljord/.julia/packages/PyCall/BD546/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'IndexError'> IndexError('tuple index out of range') File "/Users/cljord/.julia/conda/3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fit fit(X, y, sample_weight, solver_type, kernel, random_seed=seed) File "/Users/cljord/.julia/conda/3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 268, in _dense_fit if X.shape[0] != X.shape[1]:

cstjean commented 2 years ago

Looks to me like you're getting Python objects passed to your function. I would look in the PyCall documentation, but maybe dot(Array(X), Array(Y)) could work? dot does not seem to work with python arrays.

cljord commented 2 years ago

I tried dot(Array(X), Array(Y)) and that didn't work, I'll check that out more in the coming weeks.

I saw on another issue that ScikitLearn.jl is currently more of a gateway into the new ecosystem, but if it fits with the current vision for ScikitLearn.jl, I'd like to contribute a passage/page to the documentation about how to use a custom kernel for the SVC (as I would have appreciated it myself).

Basically a short example along the lines of the example from the Python sk-learn I linked above and a mention that you probably have to figure out how it works with PyCall (might go more in-depth depending on if I figure it out myself).

If you think it's unnecessay, we can also close the issue.

cstjean commented 2 years ago

Looking at the docs, there should probably be a short page like Relationship to PyCall, that explains how it works. That would be a good place for your example. That might be significant work, though. Hmmm. Maybe you can start it, and it can be expanded later.

Beware that ScikitLearn hasn't been super-well maintained, so making any kind of PR is a journey!

cljord commented 2 years ago

Sounds good. I'm not very familiar with PyCall so I'll have to get a bit more understanding first, but since this isn't urgent, I'll just do that soon-ish.

I was thinking that maybe 2 pages would be good, one for custom kernels with the example above that shows how easy it is (like sklearn for Python just writing a function and adding it as a parameter for SVC). Then the other page that you mentioned, Relationship to PyCall, and linking to that, to show that, depending on your kernel function, you'll have to use PyCall to get it working.

cstjean commented 2 years ago

I'd see more a Relationship to PyCall page, with a subsection on Callback functions. Something like "Callback functions should just work, however the objects you'll receive might be Python arrays, you'll need to do XYZ." Then provide the SVC example.