Closed flip111 closed 4 years ago
Hello,
0/L1/L2/EN
refer to the penalty used, 0
is "just" logistic regression, L1
is a sparsity penalty (l1 norm, read on Lasso regression, sparsity penalty etc), L2
is a stability penalty (l2 norm, read on Tikhonov regularization etc), EN
is elastic net which is a combination of L1
and L2
. If you don't know what to do, use L2
.
You may benefit from using MLJ
if you're fairly new to ML and/or Julia, see for instance this tutorial: https://alan-turing-institute.github.io/MLJTutorials/pub/isl/lab-4.html#logistic_regression
Otherwise here's a quick example if you only have floating point stuff:
using MLJLinearModels
λ = 5.0
lr = LogisticRegression(λ) # by default, applies a l2 penalty with factor lambda
sol = fit(lr, X, y)
assuming X
is a matrix with n x p
floating point entries and y
is a n
vector with entries +-1.
Let me know if you got that working! You can look at the tests for more examples but I understand that can be a bit daunting, I very much need to work on getting some docs going...
Those tutorials look great i will give them a try.
Do you accept a PR about 0/L1/L2/EN
and a link to the tutorials in the readme?
Do i understand correctly that the outcome (yes or no) needs to be in the form -1
and 1
. And not 0
and 1
?
I have a feeling that to get the same logistic regression as in sklearn.linear_model.LogisticRegression i should use 0
"just logistic regression". Why do you advise L2
?
I don't understand what the lambda is for in your example in scikit there is no need to pass it.
I don't know if you are familiar with GLM.jl but if you are could you say anything about how this library compares to that one?
Thank you for answering :)
-1, 1
if you use the barebone package, if you use MLJ, it can be whatever you want. If you have a vector of 0, 1
just do ynew = (2y .- 1)
C
parameter, L2 prevents numerical instabilities which would otherwise be quite common with logregMain differences with GLM: this package is geared for performance and flexibility, allowing multiple solvers, multiple penalties etc, for entry level users with small data this is more or less irrelevant. GLM supports things that are useful in a statistical perspective of model fitting such as "significance" of coefficients, MLJLM doesn't do that (and will not, this is deliberate).
As far as I know GLM does not support multiclass logreg. GLM supports some regression models that are not (currently) supported here, for instance count regression.
PS: forgive me if I'm incorrect but you seem quite new to ML and Julia, if that's the case I recommend you look at the MLJ tutorials in order. First start with the data stuff, then the "Introduction to Stat Learning" labs (which you can do along the side of having a look at the book). Help improving those tutorials for entry level users would be very welcome (feel free to open issues on the mlj tutorials repository and I'll help you there to)
Hi @tlienart thanks for your answers :) You are right that i'm new to ML and julia. At the moment i have the following code
import Pkg;
Pkg.add("CSV")
Pkg.add("DataFrames")
using CSV
using DataFrames
cols = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
train_df = DataFrame(CSV.File("../../data/train.csv"))
train_df = train_df[:, filter(x -> string(x) in cols, names(train_df))]
train_df = dropmissing(train_df)
X_train = train_df[:, filter(x -> x != :Survived, names(train_df))]
Y_train = train_df[:Survived]
categorical!(df, :Sex)
categorical!(df, :Embarked)
This is from the titanic dataset on kaggle.
The DataFrames manual states the following:
Using categorical arrays is important for working with the GLM package. When fitting regression models, CategoricalVector columns in the input are translated into 0/1 indicator columns in the ModelMatrix with one column for each of the levels of the CategoricalVector. This allows one to analyze categorical data efficiently.
In scikit this can be done with the OneHotEncoder. Where the parameter drop
can be set to "first"
so that the dummy variable trap is avoided.
Do you have any plans to include some encoders and integration with dataframes? At the moment it seems a bit rough for me because i'm new to julia as well.
By the way i checked Scikit and found back the L2
there. I was trying to find the default value of lambda, but i suppose it's called differently in the scikit code.
It's called C
which is roughly 1/lambda (roughly because they scale it by dataset size if I'm not mistaken).
All of the data pre-processing is meant to be handled by MLJ, this includes imputation of missing values, one hot encoding, etc, etc. Please go through
on MLJTutorials ; then possibly have a look at the end to end examples (also there).
Note that if you have no background in Stats/ML, you should probably start with one of the standard books like intro to stats learning etc. Starting with the code is probably not the right way to go.
PS: your input is very welcome on MLJTutorials as we need to help users like you get started there. The repo here is not really appropriate however, hope that makes sense!
I'll close the issue here but feel free to re-open issues if you have specific problems or ask questions on MLJTutorials
Hey looks like a nice library. I would appreciate an example of a logistic regression. Also i was wondering what
0/L2
andL1/EN
mean, i don't know which one to choose. Thank you