brownplt / pyret-lang

The Pyret language.
Other
1.06k stars 106 forks source link

More stat #1745

Open blerner opened 1 month ago

blerner commented 1 month ago

Creating a PR to make it easier to review the changes

blerner commented 1 month ago

All of our "real life" data will be drawn from spreadsheets, and I don't know whether our default is to make numbers be rough or exact when we import them.

I don't think it's bad to say that estimators are always rough, but we should make a conscious design decision about that. @schanzer, thoughts?

Message ID: @.***>

schanzer commented 1 month ago

We don't coerce to roughnums unless we have to, so it's not safe to assume that all the numbers operated on by this function will be rough.

As for the output, we already assume the coefficients will be roughnums.

ds26gte commented 1 month ago

It looks like for Bootstrap purposes, @schanzer will be finding out the coefficients of the generated predictor function. (This also appears to help hide the full effulgence of the function from the user, who would otherwise have had to deal with the list type.)

Calculating the coefficients (viz., y-intercept, slope) for the single-independent-var case was easy enough by calling the function at a couple of x's, i.e., 0 and 1. However, doing something similar for the multiple-indep-var case is

  1. painful
  2. unnecessary -- since we already went through that whole matrix morass to create precisely this vector of coefficients, before converting that information into a predictor function

So it stands to reason that it's probably a Good Thing that our implementation supply both this coefficient vector (as a tuple?) and the predictor function. Or we can have two functions: one that supplies both coeff vector and predictor function, and one that is more traditional and supplies just the predictor function. (The second function internally calls the first.)

I suggest the name multiple-regression-model() for the first function, retaining multiple-regression() for the latter. They both would take the same parameters: a list of tuples of numbers, and a list of numbers.

What say?

blerner commented 1 month ago

Question: is this latest rewrite (of linear into multiple-linear) a good thing? It streamlines the code, yes, but also slows down the simple linear case by making it have a bunch of matrix and tuple overhead... I'm honestly torn on whether this is a net win.

shriram commented 1 month ago

@ds26gte benchmarked it and claimed it was actually faster to use the multiple-variable version?

blerner commented 1 month ago

That would be pretty depressing if true -- it saves (if I've counted right) 1 Pyret-level map, and replaces it with O(n) tuple allocations and a bunch of matrix loops (all of which are 1 iteration long, granted, but still). I don't see how that can/should be faster!

ds26gte commented 1 month ago

The OG Pyret version does the arithmetic in Pyret. The new version does the same arithmetic but in JS. There is an extra creation of singleton tuples, true, but there are also intermediate Pyret lists created in the Pyret version that are not needed in the JS version. The Pyret version also traverses the x- and y-lists multiple times (to convert them to roughnums; to dot-product them; to sum each of them, to sum the squares of the x-list). It also calls the mean (as yet another Pyret function) on them, which is two more traversals that aren't represented in the function body.

The Pyret version could possibly by tightened by avoiding repeated summing of the x-lists at least (I counted three unnecessary recalculations, including the mean. But even so, it's still doing a lot that's more efficiently done in JS.

ds26gte commented 1 month ago

Here are some numbers to go with the time comparison between the OG Pyret implementation of linear-regression() and the newer one that implements it using multiple-regression().

I used two different data-sets, one with 7 readings, the other with 14, to see if the time diff is dependent on data size. To avoid noise from the start-up time of Pyret itself, I ran the "find predictor and use it" program fragment inside a loop, for 10 iterations, and then for 10000 iterations. The difference in favor of using multiple regression shows up in the 10 iterations already. (It is not surprising that the 10000 iterations should improve the difference, as it shuts out the start-up noise even more, but it confirms that the difference in the 10-iteration case is not a fluke. The fact that there is a positive difference in the 10-iteration case shows that you don't need to have huge data sets to experience the improvement.)

The difference may not be earth-shattering, but it is clearly in favor of using multiple regression always, and using just the latter obviously translates to easier maintenance for our devs.

== Pyret impl, 7 data points, 10 iterations
real    0m1.852s
user    0m1.926s
sys     0m0.448s

== Using mul reg, 7 d, 10 i
real    0m1.849s
user    0m1.958s
sys     0m0.438s

== Pyret, 7 d, 10000 i
real    0m2.401s
user    0m2.721s
sys     0m0.449s

== Mul reg, 7 d, 10000 i
real    0m2.017s
user    0m2.290s
sys     0m0.447s

== Pyret, 14 d, 10 i
real    0m1.849s
user    0m1.956s
sys     0m0.433s

== Mul reg, 14 d, 10 i
real    0m1.834s
user    0m1.937s
sys     0m0.446s

== Pyret, 14 d, 10000 i
real    0m2.754s
user    0m3.041s
sys     0m0.490s

== Mul reg, 14 d, 10000 i
real    0m2.076s
user    0m2.396s
sys     0m0.450s
schanzer commented 1 month ago

Thanks for the detailed analysis!

This more than does the trick for Bootstrap! I will leave it to wiser heads to determine whether or not it's ready to merge.

ds26gte commented 1 month ago

I'm going to kick myself for saying this, but we (i.e., I) could in theory implement linear-regression() too as its own JS function in the trove JS file, and it can come to its result without using any matrix manipulation whatsoever (however minimal). That will probably be palpably faster than doing a multiple-regression of 1 independent var.

If single-independent regression is used significantly often enough to merit its own dedicated fast(er) implementation, then this may be the way to go.

schanzer commented 1 month ago

I hate to say it, but I think it's worthwhile.

Single-independent regression is overwhelmingly the most common case. It's literally 100% of the cases for Bootstrap right now, and when multi-independent comes online it's still going to be only for the most ambitious of teachers.

ds26gte commented 1 month ago

P.S. I wrote up the single-independent linear regression in JS (without using matrices), and for 10 iterations and a data size of 7, the multiple-regression version still trumps it (I can't explain why this should be). However, for everything else, i.e.,

7 data points, 10000 iterations
14 d, 10 i
14 d, 10000 i

the dedicated no-matrix JS wins.

ds26gte commented 1 month ago

PS:

  1. I don't know why all of arr/trove/statistics.arr can't be moved to js/trove/*.js except for wrappers
  2. And, furthermore, why all of arr/trove/*.arr can't be moved to js/trove/*.js

This should make many things faster, unless we want to be able to claim that "most of Pyret is written in Pyret itself".

schanzer commented 1 month ago

I defer to others on that one, but that feels like it would be a separate PR anyway.

blerner commented 1 month ago

This is largely a waste of effort right now. Writing libraries in JS that are stack-safe and timeout-safe is painful and error prone. Anything dealing with Pyret data values in raw JS is tedious. Ensuring plausible error checking, likewise. And the existing trove libraries exist already. Rewriting libraries in typescript is part of the goal of anchor. Don't go wasting time duplicating work.

ds26gte commented 1 month ago

Is the current state of the PR ok, or should I roll back the last change, the one that defines single-var regression as an instance of multiple-var regression?

blerner commented 1 month ago

I thought your antepenultimate comment said you'd reimplemented single-variable in JS without matrices, and it was the fastest yet? (Except in the smallest of cases, which I'm not too worried about.)

My main concern is testing -- if you have two distinct implementations, you can test the one against the other. If there's only one implementation, then I'd like to see more tests for it to confirm it works well. (E.g. permuting the variables should permute the derived coefficients, but gives the same overall result. Or that detecting singular cases works properly.) I don't know enough stats to know what a thorough test suite should be. I'd feel better if the matrix math itself were testable, but we're deliberately obscuring that for now...

ds26gte commented 1 month ago

The currently publicly visible PR contains a definition of multiple-regression() (using matrices, there's no other workable option) and re-defines linear-regression() as a call to multiple-regression() of one variable. It is therefore a single implementation that always uses matrices, and therefore requires only one maintenance track.

I can probably write a math proof of why it translates exactly to the non-matrix implementation. The numerical agreement between the two is exact for the example we have.

I have privately implemented linear-regression() as its own implementation that uses JavaScript, but it is identical to the original in-Pyret definition. Since redefinition in JS wasn't seen as useful, I'm keeping it under wraps for now.

ds26gte commented 1 month ago

https://ds26gte.github.io/notes/mulreg.html

contains the proof that the single-var matrix calculations using the multiple-regression algorithm matches exactly the result from the original Pyret implementation.

Typeset version of the proof (using Typst):

https://ds26gte.github.io/notes/mulreg.pdf

blerner commented 3 weeks ago

@ds26gte , is it mathematically correct to say that the multiple-regression you implemented is https://en.wikipedia.org/wiki/Ordinary_least_squares#Matrix/vector_formulation, and you're trying to solve ${\displaystyle \mathbf {X} {\boldsymbol {\beta }}=\mathbf {y}}$ by computing $\displaystyle {\hat {\boldsymbol {\beta }}}=\left(\mathbf {X} ^{{T} }\mathbf {X} \right)^{-1}\mathbf {X} ^{{T} }\mathbf {y}$ which minimizes the error $\displaystyle S({\boldsymbol {\beta }})=|\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}|^{2}$?

If so, is it further mathematically correct to say that you're implementing semantically the same computation as https://en.wikipedia.org/wiki/QR_decomposition#Using_for_solution_to_linear_inverse_problems for the over-determined case? That is, find a solution ${\displaystyle {\hat {\mathbf {x} }}}$ to the problem ${\displaystyle A\mathbf {x} =\mathbf {b} }$ which minimizes the norm ${\displaystyle \left|A{\hat {\mathbf {x} }}-\mathbf {b} \right|}$? (There's a square in the first that isn't in the second, but minimizing the norm should minimize the square of the norm too...)

I ask because the tentative matrix library I'm finishing includes QR-decomposition and a method that purports to solve least-squares, and while I don't entirely understand the math in it, if it's equivalent to the work you've got here (while also being more general), that'll be good to know...

ds26gte commented 3 weeks ago

I can confirm I'm doing exactly what you stated in your first paragraph. (The X matrix in our case is not just rows of the given x-inputs, but tacks on a 1 in front, because we want our B to provide coefficients not just for each x but also for the y-intercept. But that doesn't affect the correctness of the matrix equation you mentioned.)

The calculations in the second paragraph are, TBH, too complicated for me to follow. I'll take it on faith that it minimizes the norm, but I am not sure that minimizing the norm is identical to minimizing the least squares. (It could well be. If you point me to a proof, I'm convinced. I fear, however, that this is like comparing variability based on mean of abs differences with one based on root mean squares, which are not equal.) In any case, it provides a solution. The predictor function is obviously intentionally a heuristic, and thus may allow some wiggle room as to what it could be.

ds26gte commented 3 weeks ago

P.S. Also, a minor point in case it is puzzling: In your formulation you have X B = Y, whereas I have Y = B X, but all your rows are my columns and all your columns are my rows, so they're semantically identical.

blerner commented 3 weeks ago

Joy: according to https://en.wikipedia.org/wiki/Overdetermined_system#Approximate_solutions, your approach solves $\min_{\mathbf x}\lVert A \mathbf x - \mathbf b \rVert$, without the squared norm, and my approach minimizes $| A x - b |^2$ with the square. And according to the other two links above, it's exactly the other way around!

https://math.stackexchange.com/a/4307382 claims

Note that any minimizer of the 2-norm (aka the Euclidean norm or p=2 norm) is also a minimizer of the square of the 2-norm, so finding a solution to $\lVert Ax-b\rVert_2^2=(Ax-b)^T(Ax-b)$ also solves the problem, as discussed above. which indicates that both approaches work... (But I'm not sure I'm convinced by the implication direction here, unless I'm misreading it.)

Honestly I don't recall enough of linear algebra to know the difference, so if one of y'all knows a math teacher who does, or who has a textbook on it (more reliable than wiki or stackexchange articles!), that would be really helpful...

schanzer commented 3 weeks ago

Speaking of Joy - I would email her. She's taught Algebra 2, AP Stats, and (I believe) calculus or precalculus.