ISL labs - Githubissues

tlienart commented 4 years ago

[x] lab 2
[x] lab 3
[x] lab 4
[x] lab 5
~~lab 6a (requires forward/backward)~~
[x] lab 6b
~~lab 6c (requires PLS, PCR)~~
~~lab 7 (requires GAMs)~~
[x] lab 8 DT
[x] lab 9 SVM
[x] lab 10 PCA

Additional todos

[x] use confmat and similar tools from MLJBase 0.8.1
[x] draw ROC curves
[ ] other plots (?)

@nignatiadis am tagging you here as you offered to review but will only ping you again when I've gone through all of them. Thanks!

things missing in MLJ

will add as I go through the labs

conf mat + associated metrics (ongoing, see https://github.com/alan-turing-institute/MLJBase.jl/pull/88)
stuff to draw ROC curve
hierarchical clustering
stratified sampling (see issue https://github.com/alan-turing-institute/MLJ.jl/issues/108)
would be good to have mse on top of rms
bootstrapping
direct command for LOOCV ?
stuff to draw a nice graph of MSE vs Lambda for ridge

Rplot

tlienart commented 4 years ago

Right, the bulk of them are now translated, it would definitely benefit from a second pair of eyes (@vollmersj , @nignatiadis)

Some stuff to note:

Some of the things they call in R are just not available in the Julia world (yet), some are listed above, but feel free to add comments, I may also have overlooked stuff
The original R scripts have zero comments in case you were wondering why it's so terse on comments, but I'm keen to add more based on suggestions

Basically now that the base is there, we can add whatever we want to make these tutorials better and more interesting.

ablaom commented 4 years ago

Regarding, bootstrapping, can the tutorials not use Bootstrapping.jl?

nignatiadis commented 4 years ago

Hi Thibaut,

I will start with Lab 2, although I guess there is not too much to be said about it. The main thing missing compared to the R labs is an intro to plotting in Julia (though this is already one of your bullets). Two minor remarks

It seems throughout the Lab, first a code chunk is shown and then your text about it shows up below (see e.g., screenshot below). I would prefer it the other way around
"x = randn(1_000) # 500 points iid from a N(0, 1)": Typo, 1000 points

tlienart commented 4 years ago

Thanks!! I think for the moment the tutorials assume at least basic knowledge in Julia and plotting, though I'll add a link to further resources + specify that other backends than PyPlot can be used. Maybe in the future there can be more hand holding for users who are really new etc but for now it seems premature.

Though what definitely would help is to just have more plots, it'll make tutorials sexier and may help people get examples on how to do stuff (effectively with matplotlib 😝 )

@ablaom re Bootstrapping.jl , maybe, I'll try to see what can be done on that side

tlienart commented 4 years ago

PS: thanks for the heads up with respect to the text and the code alignment, this is actually not meant to be that way, will fix

Update:

now fixed the alignment issue
added some information about plotting packages at the end of LAB2
added some more plots
fixed the 1000/500 typo

I stumbled upon a very time consuming issue with ST but now that I know what caused it, I'll get back to the tutorials, in particular using the new confusion matrix and ROC stuff.

tlienart commented 4 years ago

Did another full pass on ISL tutorials today and added lots of plots + goodies from MLJBase 0.8 (confusion matrix). It should be a fair bit better.

The most obviously missing stuff that I can see is a tool to get the decision boundary for models for which it's easy to get it (e.g. SVM, DTC) and plot it. Apart from that, not much more than comments already made but of course it'd be great to have someone else's perspective 😄

nignatiadis commented 4 years ago

I worked through Lab 3 and liked it! Here are some minor suggestions: Perhaps the lab could start with a univariate regression lm(medv ~ lstat) as in the ISLR lab? The univariate regression is also easier to visualize. In round.(fp.coefs[1:3], sigdigits=3) I would prefer to show all coefficients, or at least make it clear there is 1 coefficient per variable. For the polynomial example (and for interaction example) maybe follow ISLR and use only LStat, LStat^2 (instead of adding LStat^2 to the full design matrix)? The polynomial fit here could also be visualized.

I also want to mention some other thoughts/ideas related to this lab, that however would require more work [and I am just mentioning them in case somebody --myself included-- gets interested in implementing them].

Is there an interface point in MLJ for providing e.g., prediction intervals as in R's predict(lm.fit, newdata=newdata, interval="prediction")?
I think standard errors of coefficients, p-values, etc. are important when introducing linear regression and both ESL/ISLR place a lot of emphasis on them (compared to other machine learning textbooks). One possibility would be to elaborate on the following line from the tutorial "here we will use MLJLinearModels but we could also have used GLM, ScikitLearn etc." by also demonstrating how to use GLM.jl towards the end. Then it could be explained that MLJ provides a unified interface, but e.g., the GLM.jl fit/summary can still be extracted and that gives access to p-values, etc.
Are there any thoughts on integrating StatsModels.jl with MLJ? The ISLR formula syntax such as lm(medv~lstat,data=Boston) or lm(medv~poly(lstat,5)) is a lot more convenient than having to set up the table yourself. Also it helps with categorical covariates. The formula system in StatsModels has been designed so as to be easy to extend and/or use in other packages.

tlienart commented 4 years ago

Thanks for the feedback! I add some opinionated comments below

In round.(fp.coefs[1:3], sigdigits=3) I would prefer to show all coefficients, or at least make it clear there is 1 coefficient per variable.

I'm not sure what you're suggesting here, would you want to show more explicitly what coefficient goes with what variable?

For the polynomial example (and for interaction example) maybe follow ISLR and use only LStat, LStat^2 (instead of adding LStat^2 to the full design matrix)? The polynomial fit here could also be visualized.

I disagree with this, a polynomial regression is usually done the way I did it, though this may be hidden to the user. There could be support internally to do a polynomial regression but it's not very interesting because the case where you only have one explanatory variable is very toyish and it's more difficult to come up with a simple API for a multivariate polynomial regression whereas just adding transformed columns corresponds to what we're encouraging people to do (broaden/transform your data then apply multiple models on it and compare/compose)

More interesting would be a way to support the addition of polynomially-transformed features that's a bit better than my home-made thing here.

* Is there an interface point in MLJ for providing e.g., prediction intervals as in R's `predict(lm.fit, newdata=newdata, interval="prediction")`?

Well you can make probabilistic prediction in which case the output at every point is a NormalDistribution and you could show this yes. I'll think about a visualisation for this.

* I think standard errors of coefficients, p-values, etc. are important when introducing linear regression and both ESL/ISLR place a lot of emphasis on them (compared to other machine learning textbooks).

I'm strongly against this as I think they really encourage bad practices. But that's my opinion of course. As an aside that's also why I don't like ISL. A bit more reasonably: MLJ (for the moment) focuses on the ML-style fit/predict mechanism not unlike Sklearn and so tries to not encode too many assumptions on your data/noise model etc (which is what you need if you want pvalues).

* Are there any thoughts on integrating [StatsModels.jl](https://github.com/JuliaStats/StatsModels.jl) with MLJ? The ISLR formula syntax such as `lm(medv~lstat,data=Boston)` or `lm(medv~poly(lstat,5))` is a lot more convenient than having to set up the table yourself.

Both Anthony and I really dislike this syntax (but that's somewhat irrelevant). I do think however that you should see MLJ in a broader context than just linear models. A user who would strongly want this would likely be better off not using MLJ I would think and focusing instead on GLM or, indeed, StatsModel which provide a well developped environment to do this. This is not meant as a criticism by the way! it's just that the purpose of MLJ is not to do it all but rather to focus on this fit/predict/transform and composition mechanisms.

Thanks again for the feedback!

tlienart commented 4 years ago

PS: it's funny I spent a fair bit of time adding a number of visualisations everywhere but seem to have forgotten ISL3 😢 thanks for pointing it out though!

nignatiadis commented 4 years ago

I'm not sure what you're suggesting here, would you want to show more explicitly what coefficient goes with what variable?

Yes! Instead of only showing the first three coefficients.

Regarding the rest of the discussion, I thought of the tutorial as "enabling someone reading through ISL to do it in Julia (mostly through MLJ) instead of R", rather than "MLJ through ISL". I am also biased; the Elements of Statistical learning is possibly my favorite textbook. Perhaps two more comments: I think the formula system could potentially be used as an alternative "syntax" to MLJModels.FeatureSelector; for those of us that like the syntax. Also I do not think of 1-dimensional regression problems as toy problems, e.g., only recently did we figure out how to adapt to local smoothness in a computationally efficient way (well, if we ignore wavelets), but I digress...

In any case your remarks make a lot of sense, and I appreciate how well-designed MLJ is and that this would not be possible if MLJ tried to do "everything"!

nignatiadis commented 4 years ago

Just went over Lab 5; it looks good to me!

I also tried and it seems leave-one-out cross-validation works out of the box.

tm_loo = TunedModel(model=lrm, ranges=r, resampling=CV(nfolds=392), measure=rms)

A question I have: For OLS/Ridge one can do leave-one-out efficiently without recomputing the fit. What would be the suggested MLJ interface point for that? Would I define my own LOOCVTunedRidgeRegression model and handle things internally?

Bootstrapping for e.g., the polynomial regression coefficients seems to work without a lot of trouble (not sure this is the most elegant way though, but reproduces ISLR results):

using Bootstrap

Xhp = DataFrame(hp1=hp, hp2=hp.^2, hp3=hp.^3);
lrm.fs.features = [:hp1, :hp2] # poly of degree 2
lr2 = machine(lrm, Xhp, y)

n_boot = 1000
bs_res = bootstrap(rows -> fitted_params(fit!(lr2, rows=rows, verbosity=0)).fitted_params[1].coefs,
                   collect(1:392),
                   BasicSampling(n_boot))

tlienart commented 4 years ago

Thanks a lot for this super useful continued feedback!!

A few comments

using the formulas for feature selector -- that's actually quite an interesting idea, I'm not sure how to do it but it could be nice 🤔
rest of the discussion: yeah I think the key point is to have a core that is "well done" and coherent and then support extra stuff if it makes sense globally, otherwise re-direct towards specific packages, at least that's what I'd think

Lab5

yes LOOCV specifying n-1 folds works, I was merely referring to having a command to do it directly like resampling=LOOCV() or something
ridgeCV: I'm currently working on this for MLJLinearModels
super interesting for the boostrap!! thanks a lot for that, I'll look into it

JuliaAI / DataScienceTutorials.jl

ISL labs #17

Additional todos

things missing in MLJ

Lab5