Regression: use \ - Githubissues

tlienart commented 5 years ago

Just having a look at some of the code (nice piece of work btw) and in ClassicLeastSquares (*) you have

Z = (Bias) ? hcat( repeat( [ 1 ], size( X )[ 1 ] ), X ) : X
return ClassicLeastSquares(Base.inv(Z' * Z) * Z' * Y, Bias)

I think that's quite inefficient on several levels:

you probably want to use ones instead of repeat(...)
you probably want to check that the type of the ones corresponds to the eltype of X
you really don't want to use Base.inv here and should use Z\Y
you could make it more efficient by skipping a step

bias || return ClassicLeastSquares(X\Y, bias)
return ClassicLeastSquares(hcat... \Y, bias)

as soon as you have large number of rows and or columns the above comments would matter a fair bit.

it's probably a good idea to also distinguish between the main coefficients and the intercept so that your PredictFn doesn't need to reconstruct a hcat matrix (costly).

These comments apply for the rest of the file too.

(*) minor remark, why not OrdinaryLeastSquares which I believe is the standard name with standard abbreviation OLS?

Edit: side note, why not integrate more with mature packages like Clustering.jl ?

caseykneale commented 5 years ago

Thanks for checking the code tlienart! Also thank you for focussing on one of the core functions in the regression code base! I agree, it should be ones not repeat, and yea the backslash operator can be implemented instead. I'll fix that tonight.

As far as why not OLS vs CLS, in chemometrics OLS is called CLS to distinguish it from ILS (directionality of the problem statement matters). I'll make a wrapper for OLS though because, you're right, for most people that's what it's called.

Please feel free to grok other files and make suggestions. Most of this was thrown together very quickly. I plan on really hitting home performance and end user features in later releases, but with things like this why not start now. Simple enough. I think I wrote the 95% of the regression methods in a single day, and it shows.

As far as why not integrate with Clustering.jl, well I'm not sure what that might involve. I see two-three packages I could symbiotically exist with, but I don't know the developers or how much free time I really have to do work beyond what I want to do (day job struggles). Some of the packages change a lot from release to release.

caseykneale commented 5 years ago

@tlienart - I'm actually on the fence about this. The rigorous equations for LS regression are solving for B=(X^tX)^-1X^tY

This is similar to how we do a pseudo inverse, right(Moore-Penrose)? We can regress things, although we, maybe shouldn't, of lower rank without hitting a singular exception compared to doing the naive: Y=BX ->>> YX^-1 = B (or whatever direction you chose to set up the problem).

If we abide by the typical covariance expression for OLS, then it makes more sense that the matrix multiplications are of heavier computational load vs a horizontal concatenation. I'm still totally flexible to changing things, but I am leaning more towards the correct equation in clean code than speed.

If it matters to someone I know they can write OLS in 1 fn call the way they want it baked.

tlienart commented 5 years ago

A\b is the right way to go, it corresponds exactly to finding x such that the l2 loss |Ax-b|_2 is minimised. If you look at any code from other packages that do regression you'll see that they use \ ( (there's quite a few, in Julia you could take https://github.com/lindahua/Regression.jl for an old one), this would hold outside of Julia too).

It's not just a matter of speed. It's just the right thing to do.

Re your comment on Clustering, I think it's up to you in a way. If you just want to write a package for yourself and investigate in all these ML tools and use it on small datasets that are relevant to you then you can safely ignore my comments and just go for it. However if you want to have a package that will be used by a wider audience then you should pay attention to state of the art implementations. For instance kmeans in Clustering.jl (which I recently helped refactoring btw) uses a specific algorithm (K++) for seeding kmeans; most big ML packages out there would implement this in some fashion. Not doing this will "work" but is much slower on average.

While it's true that some of the packages in the Julia ML ecosystem are under a fair bit of active dev and sometimes API changes, it's not true for things like Clustering.jl, OnlineStats, Flux, DecisionTree etc where the API is rather stable even though the algorithms in the background may still be improved. That doesn't mean those packages are perfect but just that a fair bit of work has gone into them. Starting from scratch is doable, may even lead to better algos if you're determined, but if you want to do this for many algos then that represents a huge amount of work; that's why it might make more sense to integrate with existing packages for some of the basic algos.

As a side comment, I'm remotely involved in MLJ which is attempting to build a common julia framework for ML and integrate with ML packages, a lot of thinking has gone into getting pipelines etc designed "right" and while it's still very much wip, it's already gained a fair bit of attention (also there are people being paid to develop it which may help push development). Again this by no means prevents you from going on with chemometrics, but you may find it interesting to keep an eye on what goes on there, what design decisions were taken and why (again with the working hypothesis that you care about your package being useful to a reasonably large group of people).

Just my 2 cents : )

caseykneale commented 5 years ago

My main goal here is to switch chemometricians over to julia. So I want this package to have all the tools they could ask for in house. But you're right, being more widely applicable to machine learning would be more useful for the user-base at large.

I really don't want to be inflexible but the correct equation is what I stated before and the backslash operator is not the same. I've tested it; there is a numerical difference between the results, when the rank of X is low. I could do (XTX) / XTY . For now I think I'll keep my regression coefficient expression the same, but I did implement the other changes you suggested, they were good ones. I will revisit changes like this once I have filled in all the functionality I want available.There is going to be some structure changes to include NNLS constraints inside of many methods, once I work out the best design I can.

Well I want my package to be useful to other people. I've seen MLJ. If you guys want to borrow any code from my project please feel free its' MIT licensed; I have lots of metrics. Yea, I know my Kmeans is dinky and not the efficient way - trust me, but it is included for one reason: does someone want to pkg add something else and install it, then using pkg, when they just want to run a quick kmeans while they are doing things in this package, probably not? If someone wants a scalable k means, they shouldn't be using this, I believe the Docs specify it is McQueens algorithm.

Currently the time it takes to build/precompile this package is pretty long on an average computer 1-2 minutes, and I've kept the dependencies pretty minimal, but want to shrink them even more. I think most of that is Docs compilation, If I could slim that down then I'd be more willing to include other peoples packages.

I didn't know DecisionTree.jl existed. I'll have to do some performance comparisons between it and my algorithm. My first few cracks at decision trees and tree ensembles were painfully slow because of how Julia handled the recursion patterns that I wrote, I'd like to see how they overcame that. My solution was to write a nonrecursive variant.

caseykneale commented 5 years ago

Also writing algorithms from scratch isn't really work for me. I do this as a labor of love, and it keeps me fresh for understanding the tools I use on a daily basis(I use this package on a daily basis). Each model has assumptions and I'm glad to know them, especially when things fail, 2 models offer synonymous results, and to compare them for ensemble building. A lot of practitioners don't think like this, and can be outsourced to a for-loop of methods. That's also why it's important to me that the equations in the methods are as legible as possible but still performant, so people can learn from the code-base while they self study.

tlienart commented 5 years ago

Up to you.

As for the ‘\’ I would encourage you to ask the question on the discourse forum where more people would be able to chip in though just as a side note, Lindahua who wrote regression.jl is not exactly a novice in numerical methods.

anyway I think I can close this issue & wish you luck for the rest of the package, it will likely be a good project & a good way to learn about these things

caseykneale / ChemometricsTools.jl

Regression: use \ #7