JuliaAI / MLJLinearModels.jl

Generalized Linear Regressions Models (penalized regressions, robust regressions, ...)
MIT License
81 stars 13 forks source link

Example of logistic regression #46

Closed flip111 closed 4 years ago

flip111 commented 4 years ago

Hey looks like a nice library. I would appreciate an example of a logistic regression. Also i was wondering what 0/L2 and L1/EN mean, i don't know which one to choose. Thank you

tlienart commented 4 years ago

Hello,

If you don't know what to do, use L2.

You may benefit from using MLJ if you're fairly new to ML and/or Julia, see for instance this tutorial: https://alan-turing-institute.github.io/MLJTutorials/pub/isl/lab-4.html#logistic_regression

Otherwise here's a quick example if you only have floating point stuff:

using MLJLinearModels

λ = 5.0
lr = LogisticRegression(λ) # by  default, applies a l2 penalty with factor lambda
sol = fit(lr, X, y)

assuming X is a matrix with n x p floating point entries and y is a n vector with entries +-1.

Let me know if you got that working! You can look at the tests for more examples but I understand that can be a bit daunting, I very much need to work on getting some docs going...

flip111 commented 4 years ago

Those tutorials look great i will give them a try.

Do you accept a PR about 0/L1/L2/EN and a link to the tutorials in the readme?

Do i understand correctly that the outcome (yes or no) needs to be in the form -1 and 1. And not 0 and 1?

I have a feeling that to get the same logistic regression as in sklearn.linear_model.LogisticRegression i should use 0 "just logistic regression". Why do you advise L2?

I don't understand what the lambda is for in your example in scikit there is no need to pass it.

I don't know if you are familiar with GLM.jl but if you are could you say anything about how this library compares to that one?

Thank you for answering :)

tlienart commented 4 years ago

Main differences with GLM: this package is geared for performance and flexibility, allowing multiple solvers, multiple penalties etc, for entry level users with small data this is more or less irrelevant. GLM supports things that are useful in a statistical perspective of model fitting such as "significance" of coefficients, MLJLM doesn't do that (and will not, this is deliberate).

As far as I know GLM does not support multiclass logreg. GLM supports some regression models that are not (currently) supported here, for instance count regression.

PS: forgive me if I'm incorrect but you seem quite new to ML and Julia, if that's the case I recommend you look at the MLJ tutorials in order. First start with the data stuff, then the "Introduction to Stat Learning" labs (which you can do along the side of having a look at the book). Help improving those tutorials for entry level users would be very welcome (feel free to open issues on the mlj tutorials repository and I'll help you there to)

flip111 commented 4 years ago

Hi @tlienart thanks for your answers :) You are right that i'm new to ML and julia. At the moment i have the following code

import Pkg;
Pkg.add("CSV")
Pkg.add("DataFrames")

using CSV
using DataFrames

cols = ["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

train_df = DataFrame(CSV.File("../../data/train.csv"))
train_df = train_df[:, filter(x -> string(x) in cols, names(train_df))]
train_df = dropmissing(train_df)

X_train = train_df[:, filter(x -> x != :Survived, names(train_df))]
Y_train = train_df[:Survived]
categorical!(df, :Sex)
categorical!(df, :Embarked)

This is from the titanic dataset on kaggle.

The DataFrames manual states the following:

Using categorical arrays is important for working with the GLM package. When fitting regression models, CategoricalVector columns in the input are translated into 0/1 indicator columns in the ModelMatrix with one column for each of the levels of the CategoricalVector. This allows one to analyze categorical data efficiently.

In scikit this can be done with the OneHotEncoder. Where the parameter drop can be set to "first"so that the dummy variable trap is avoided.

Do you have any plans to include some encoders and integration with dataframes? At the moment it seems a bit rough for me because i'm new to julia as well.

By the way i checked Scikit and found back the L2 there. I was trying to find the default value of lambda, but i suppose it's called differently in the scikit code.

tlienart commented 4 years ago

It's called C which is roughly 1/lambda (roughly because they scale it by dataset size if I'm not mistaken).

All of the data pre-processing is meant to be handled by MLJ, this includes imputation of missing values, one hot encoding, etc, etc. Please go through

  1. the few data-specific tutorials
  2. the ISL tutorials

on MLJTutorials ; then possibly have a look at the end to end examples (also there).

Note that if you have no background in Stats/ML, you should probably start with one of the standard books like intro to stats learning etc. Starting with the code is probably not the right way to go.

tlienart commented 4 years ago

PS: your input is very welcome on MLJTutorials as we need to help users like you get started there. The repo here is not really appropriate however, hope that makes sense!

I'll close the issue here but feel free to re-open issues if you have specific problems or ask questions on MLJTutorials