JuliaAI / DataScienceTutorials.jl

A set of tutorials to show how to use Julia for data science (DataFrames, MLJ, ...)
https://juliaai.github.io/DataScienceTutorials.jl/
MIT License
116 stars 18 forks source link

ISL labs #17

Closed tlienart closed 4 years ago

tlienart commented 4 years ago

Additional todos

@nignatiadis am tagging you here as you offered to review but will only ping you again when I've gone through all of them. Thanks!

things missing in MLJ

will add as I go through the labs

Rplot

tlienart commented 4 years ago

Right, the bulk of them are now translated, it would definitely benefit from a second pair of eyes (@vollmersj , @nignatiadis)

Some stuff to note:

Basically now that the base is there, we can add whatever we want to make these tutorials better and more interesting.

ablaom commented 4 years ago

Regarding, bootstrapping, can the tutorials not use Bootstrapping.jl?

nignatiadis commented 4 years ago

Hi Thibaut,

I will start with Lab 2, although I guess there is not too much to be said about it. The main thing missing compared to the R labs is an intro to plotting in Julia (though this is already one of your bullets). Two minor remarks

tlienart commented 4 years ago

Thanks!! I think for the moment the tutorials assume at least basic knowledge in Julia and plotting, though I'll add a link to further resources + specify that other backends than PyPlot can be used. Maybe in the future there can be more hand holding for users who are really new etc but for now it seems premature.

Though what definitely would help is to just have more plots, it'll make tutorials sexier and may help people get examples on how to do stuff (effectively with matplotlib 😝 )

@ablaom re Bootstrapping.jl , maybe, I'll try to see what can be done on that side

tlienart commented 4 years ago

PS: thanks for the heads up with respect to the text and the code alignment, this is actually not meant to be that way, will fix

Update:

I stumbled upon a very time consuming issue with ST but now that I know what caused it, I'll get back to the tutorials, in particular using the new confusion matrix and ROC stuff.

tlienart commented 4 years ago

Did another full pass on ISL tutorials today and added lots of plots + goodies from MLJBase 0.8 (confusion matrix). It should be a fair bit better.

The most obviously missing stuff that I can see is a tool to get the decision boundary for models for which it's easy to get it (e.g. SVM, DTC) and plot it. Apart from that, not much more than comments already made but of course it'd be great to have someone else's perspective 😄

nignatiadis commented 4 years ago

I worked through Lab 3 and liked it! Here are some minor suggestions: Perhaps the lab could start with a univariate regression lm(medv ~ lstat) as in the ISLR lab? The univariate regression is also easier to visualize. In round.(fp.coefs[1:3], sigdigits=3) I would prefer to show all coefficients, or at least make it clear there is 1 coefficient per variable. For the polynomial example (and for interaction example) maybe follow ISLR and use only LStat, LStat^2 (instead of adding LStat^2 to the full design matrix)? The polynomial fit here could also be visualized.

I also want to mention some other thoughts/ideas related to this lab, that however would require more work [and I am just mentioning them in case somebody --myself included-- gets interested in implementing them].

tlienart commented 4 years ago

Thanks for the feedback! I add some opinionated comments below

In round.(fp.coefs[1:3], sigdigits=3) I would prefer to show all coefficients, or at least make it clear there is 1 coefficient per variable.

I'm not sure what you're suggesting here, would you want to show more explicitly what coefficient goes with what variable?

For the polynomial example (and for interaction example) maybe follow ISLR and use only LStat, LStat^2 (instead of adding LStat^2 to the full design matrix)? The polynomial fit here could also be visualized.

I disagree with this, a polynomial regression is usually done the way I did it, though this may be hidden to the user. There could be support internally to do a polynomial regression but it's not very interesting because the case where you only have one explanatory variable is very toyish and it's more difficult to come up with a simple API for a multivariate polynomial regression whereas just adding transformed columns corresponds to what we're encouraging people to do (broaden/transform your data then apply multiple models on it and compare/compose)

More interesting would be a way to support the addition of polynomially-transformed features that's a bit better than my home-made thing here.

* Is there an interface point in MLJ for providing e.g., prediction intervals as in R's `predict(lm.fit, newdata=newdata, interval="prediction")`?

Well you can make probabilistic prediction in which case the output at every point is a NormalDistribution and you could show this yes. I'll think about a visualisation for this.

* I think standard errors of coefficients, p-values, etc. are important when introducing linear regression and both ESL/ISLR place a lot of emphasis on them (compared to other machine learning textbooks). 

I'm strongly against this as I think they really encourage bad practices. But that's my opinion of course. As an aside that's also why I don't like ISL. A bit more reasonably: MLJ (for the moment) focuses on the ML-style fit/predict mechanism not unlike Sklearn and so tries to not encode too many assumptions on your data/noise model etc (which is what you need if you want pvalues).

* Are there any thoughts on integrating [StatsModels.jl](https://github.com/JuliaStats/StatsModels.jl) with MLJ? The ISLR formula syntax such as `lm(medv~lstat,data=Boston)` or `lm(medv~poly(lstat,5))` is a lot more convenient than having to set up the table yourself. 

Both Anthony and I really dislike this syntax (but that's somewhat irrelevant). I do think however that you should see MLJ in a broader context than just linear models. A user who would strongly want this would likely be better off not using MLJ I would think and focusing instead on GLM or, indeed, StatsModel which provide a well developped environment to do this. This is not meant as a criticism by the way! it's just that the purpose of MLJ is not to do it all but rather to focus on this fit/predict/transform and composition mechanisms.

Thanks again for the feedback!

tlienart commented 4 years ago

PS: it's funny I spent a fair bit of time adding a number of visualisations everywhere but seem to have forgotten ISL3 😢 thanks for pointing it out though!

nignatiadis commented 4 years ago

I'm not sure what you're suggesting here, would you want to show more explicitly what coefficient goes with what variable?

Yes! Instead of only showing the first three coefficients.

Regarding the rest of the discussion, I thought of the tutorial as "enabling someone reading through ISL to do it in Julia (mostly through MLJ) instead of R", rather than "MLJ through ISL". I am also biased; the Elements of Statistical learning is possibly my favorite textbook. Perhaps two more comments: I think the formula system could potentially be used as an alternative "syntax" to MLJModels.FeatureSelector; for those of us that like the syntax. Also I do not think of 1-dimensional regression problems as toy problems, e.g., only recently did we figure out how to adapt to local smoothness in a computationally efficient way (well, if we ignore wavelets), but I digress...

In any case your remarks make a lot of sense, and I appreciate how well-designed MLJ is and that this would not be possible if MLJ tried to do "everything"!

nignatiadis commented 4 years ago

Just went over Lab 5; it looks good to me!

I also tried and it seems leave-one-out cross-validation works out of the box.

tm_loo = TunedModel(model=lrm, ranges=r, resampling=CV(nfolds=392), measure=rms)

A question I have: For OLS/Ridge one can do leave-one-out efficiently without recomputing the fit. What would be the suggested MLJ interface point for that? Would I define my own LOOCVTunedRidgeRegression model and handle things internally?

Bootstrapping for e.g., the polynomial regression coefficients seems to work without a lot of trouble (not sure this is the most elegant way though, but reproduces ISLR results):

using Bootstrap

Xhp = DataFrame(hp1=hp, hp2=hp.^2, hp3=hp.^3);
lrm.fs.features = [:hp1, :hp2] # poly of degree 2
lr2 = machine(lrm, Xhp, y)

n_boot = 1000
bs_res = bootstrap(rows -> fitted_params(fit!(lr2, rows=rows, verbosity=0)).fitted_params[1].coefs,
                   collect(1:392),
                   BasicSampling(n_boot))
tlienart commented 4 years ago

Thanks a lot for this super useful continued feedback!!

A few comments

Lab5