frankiethull / maize

tidymodels extension package; binding specialty kernels for support vector machines to {parsnip} and {recipes}
Other
5 stars 1 forks source link

stringdot text classification input format #11

Open frankiethull opened 2 days ago

frankiethull commented 2 days ago

Based on the documentation, the S4 method for kernlab::ksvm with stringdot requires "list" inputs. Both the predictor as list and label as a list, instead of formula and data.frame.

Inputs for text classification will require additional steps to remain tidy.

these steps will be tested in the popcorn_garland branch (e.g. string kernel pun intended).

frankiethull commented 2 days ago

https://github.com/frankiethull/maize/issues/9#issuecomment-2383189414 @simonpcouch - hollering now related to a list method :wink:

this is one of the kernels I had initially skipped. The example below works but not sure how to bind in parsnip in a tidy way.

Using the underlying package, I have to format two lists, instead of data.frame and formula ..


# Create two separate lists for descriptions and labels
descriptions <- list(
  "Yellow kernels on a cob",
  "Grows in tall stalks in fields",
  "Sweet vegetable with husks",
  "Golden corn ready for harvest",
  "Juicy corn kernels on the cob",
  "Corn silk hanging from the husk",
  "Rows of kernels on a green stalk",
  "Corn ears wrapped in leaves",

  "Red apple growing on a tree",
  "Green leaves on a bush",
  "Orange carrot in the ground",
  "Purple grapes on a vine",
  "Brown potato from the soil",
  "Yellow banana in a bunch",
  "Red tomato on the vine",
  "Green broccoli florets"
)

labels <- factor(c(rep("corn", 8), rep("not corn", 8)))

# Train the SVM model using ksvm with stringdot kernel, the S4 method requires lists for stringdot!
svm_model <- ksvm(descriptions, labels,
                  kernel = "stringdot",
                  kpar = list(length = 4, lambda = 0.5),
                  C = 1)

The issue is the non-tidy inputs for text, both being lists. This kernel doesn't seem to work with data.frames or formulas.

Are there any engines already bound to parsnip that require this type of (x,y) list input? I sifted through a few source codes but didn't see any. I was hoping to handle the lists in the model registration, even if this kernel only works with fit_xy. Appreciate any guidance before I go in the wrong direction wrapping ksvm with another function converting formula and data.frame into lists for this kernel.

simonpcouch commented 1 day ago

Ahhh, hm. The interface slot of set_fit(value) is what comes to mind here, where "data.frame" might be able to handle x as a list, but parsnip wasn't designed to handle that input format and may still trip up.

What we do in some situations where fit functions don't have an interface that aligns with parsnip's expectation is write our own wrappers that either take formula + data or x + y (where x is a data.frame) and then do minimal conversions to interface with the modeling engine itself. So, in this example, you'd write a wrapper (say, k_svm()) around ksvm() that takes ksvm(x) as as.data.frame(x) and then extracts out the list of interest and passes it to ksvm(), and then register k_svm() with parsnip!

frankiethull commented 1 day ago

thanks for the feedback @simonpcouch! Will do some testing, I was on route to do the second option (a minimal wrapper) but glad you shared the lightgbm example to reference!