Closed tlienart closed 4 years ago
Right, the bulk of them are now translated, it would definitely benefit from a second pair of eyes (@vollmersj , @nignatiadis)
Some stuff to note:
Basically now that the base is there, we can add whatever we want to make these tutorials better and more interesting.
Regarding, bootstrapping, can the tutorials not use Bootstrapping.jl?
Hi Thibaut,
I will start with Lab 2, although I guess there is not too much to be said about it. The main thing missing compared to the R labs is an intro to plotting in Julia (though this is already one of your bullets). Two minor remarks
It seems throughout the Lab, first a code chunk is shown and then your text about it shows up below (see e.g., screenshot below). I would prefer it the other way around
"x = randn(1_000) # 500 points iid from a N(0, 1)": Typo, 1000 points
Thanks!! I think for the moment the tutorials assume at least basic knowledge in Julia and plotting, though I'll add a link to further resources + specify that other backends than PyPlot can be used. Maybe in the future there can be more hand holding for users who are really new etc but for now it seems premature.
Though what definitely would help is to just have more plots, it'll make tutorials sexier and may help people get examples on how to do stuff (effectively with matplotlib 😝 )
@ablaom re Bootstrapping.jl , maybe, I'll try to see what can be done on that side
PS: thanks for the heads up with respect to the text and the code alignment, this is actually not meant to be that way, will fix
Update:
I stumbled upon a very time consuming issue with ST but now that I know what caused it, I'll get back to the tutorials, in particular using the new confusion matrix and ROC stuff.
Did another full pass on ISL tutorials today and added lots of plots + goodies from MLJBase 0.8 (confusion matrix). It should be a fair bit better.
The most obviously missing stuff that I can see is a tool to get the decision boundary for models for which it's easy to get it (e.g. SVM, DTC) and plot it. Apart from that, not much more than comments already made but of course it'd be great to have someone else's perspective 😄
I worked through Lab 3 and liked it! Here are some minor suggestions: Perhaps the lab could start with a univariate regression lm(medv ~ lstat)
as in the ISLR lab? The univariate regression is also easier to visualize. In round.(fp.coefs[1:3], sigdigits=3)
I would prefer to show all coefficients, or at least make it clear there is 1 coefficient per variable. For the polynomial example (and for interaction example) maybe follow ISLR and use only LStat, LStat^2 (instead of adding LStat^2 to the full design matrix)? The polynomial fit here could also be visualized.
I also want to mention some other thoughts/ideas related to this lab, that however would require more work [and I am just mentioning them in case somebody --myself included-- gets interested in implementing them].
predict(lm.fit, newdata=newdata, interval="prediction")
?lm(medv~lstat,data=Boston)
or lm(medv~poly(lstat,5))
is a lot more convenient than having to set up the table yourself. Also it helps with categorical covariates. The formula system in StatsModels has been designed so as to be easy to extend and/or use in other packages.Thanks for the feedback! I add some opinionated comments below
In
round.(fp.coefs[1:3], sigdigits=3)
I would prefer to show all coefficients, or at least make it clear there is 1 coefficient per variable.
I'm not sure what you're suggesting here, would you want to show more explicitly what coefficient goes with what variable?
For the polynomial example (and for interaction example) maybe follow ISLR and use only LStat, LStat^2 (instead of adding LStat^2 to the full design matrix)? The polynomial fit here could also be visualized.
I disagree with this, a polynomial regression is usually done the way I did it, though this may be hidden to the user. There could be support internally to do a polynomial regression but it's not very interesting because the case where you only have one explanatory variable is very toyish and it's more difficult to come up with a simple API for a multivariate polynomial regression whereas just adding transformed columns corresponds to what we're encouraging people to do (broaden/transform your data then apply multiple models on it and compare/compose)
More interesting would be a way to support the addition of polynomially-transformed features that's a bit better than my home-made thing here.
* Is there an interface point in MLJ for providing e.g., prediction intervals as in R's `predict(lm.fit, newdata=newdata, interval="prediction")`?
Well you can make probabilistic prediction in which case the output at every point is a NormalDistribution and you could show this yes. I'll think about a visualisation for this.
* I think standard errors of coefficients, p-values, etc. are important when introducing linear regression and both ESL/ISLR place a lot of emphasis on them (compared to other machine learning textbooks).
I'm strongly against this as I think they really encourage bad practices. But that's my opinion of course. As an aside that's also why I don't like ISL. A bit more reasonably: MLJ (for the moment) focuses on the ML-style fit/predict mechanism not unlike Sklearn and so tries to not encode too many assumptions on your data/noise model etc (which is what you need if you want pvalues).
* Are there any thoughts on integrating [StatsModels.jl](https://github.com/JuliaStats/StatsModels.jl) with MLJ? The ISLR formula syntax such as `lm(medv~lstat,data=Boston)` or `lm(medv~poly(lstat,5))` is a lot more convenient than having to set up the table yourself.
Both Anthony and I really dislike this syntax (but that's somewhat irrelevant). I do think however that you should see MLJ in a broader context than just linear models. A user who would strongly want this would likely be better off not using MLJ I would think and focusing instead on GLM or, indeed, StatsModel which provide a well developped environment to do this. This is not meant as a criticism by the way! it's just that the purpose of MLJ is not to do it all but rather to focus on this fit/predict/transform and composition mechanisms.
Thanks again for the feedback!
PS: it's funny I spent a fair bit of time adding a number of visualisations everywhere but seem to have forgotten ISL3 😢 thanks for pointing it out though!
I'm not sure what you're suggesting here, would you want to show more explicitly what coefficient goes with what variable?
Yes! Instead of only showing the first three coefficients.
Regarding the rest of the discussion, I thought of the tutorial as "enabling someone reading through ISL to do it in Julia (mostly through MLJ) instead of R", rather than "MLJ through ISL". I am also biased; the Elements of Statistical learning is possibly my favorite textbook. Perhaps two more comments: I think the formula system could potentially be used as an alternative "syntax" to MLJModels.FeatureSelector
; for those of us that like the syntax. Also I do not think of 1-dimensional regression problems as toy problems, e.g., only recently did we figure out how to adapt to local smoothness in a computationally efficient way (well, if we ignore wavelets), but I digress...
In any case your remarks make a lot of sense, and I appreciate how well-designed MLJ is and that this would not be possible if MLJ tried to do "everything"!
Just went over Lab 5; it looks good to me!
I also tried and it seems leave-one-out cross-validation works out of the box.
tm_loo = TunedModel(model=lrm, ranges=r, resampling=CV(nfolds=392), measure=rms)
A question I have: For OLS/Ridge one can do leave-one-out efficiently without recomputing the fit. What would be the suggested MLJ interface point for that? Would I define my own LOOCVTunedRidgeRegression
model and handle things internally?
Bootstrapping for e.g., the polynomial regression coefficients seems to work without a lot of trouble (not sure this is the most elegant way though, but reproduces ISLR results):
using Bootstrap
Xhp = DataFrame(hp1=hp, hp2=hp.^2, hp3=hp.^3);
lrm.fs.features = [:hp1, :hp2] # poly of degree 2
lr2 = machine(lrm, Xhp, y)
n_boot = 1000
bs_res = bootstrap(rows -> fitted_params(fit!(lr2, rows=rows, verbosity=0)).fitted_params[1].coefs,
collect(1:392),
BasicSampling(n_boot))
Thanks a lot for this super useful continued feedback!!
A few comments
n-1
folds works, I was merely referring to having a command to do it directly like resampling=LOOCV()
or something
lab 6a (requires forward/backward)lab 6c (requires PLS, PCR)lab 7 (requires GAMs)Additional todos
@nignatiadis am tagging you here as you offered to review but will only ping you again when I've gone through all of them. Thanks!
things missing in MLJ
will add as I go through the labs
mse
on top ofrms