TODOs in HordeAd.Core.DualNumber about defining more derivatives of common functions

sfindeisen commented 2 years ago

Fixes https://github.com/Mikolaj/horde-ad/issues/19

Mikolaj commented 2 years ago

Thank you very much, Staszek!

CI fails. In most of the cases, that's because your formula for tanh, while beautiful (no doubt, though I can't tell if it's better behaved numerically or if it's faster or slower and on what architectures), gives subtly different results than the old one. In some cases the whole computation is not numerically stable wrt to the tanh computation and then the difference becomes huge.

However, not all failures are clear, so let's try to understand them before deciding on a fix. A couple of failures are probably due to my sparse matrix experiment not coping well with matrices of undetermined size (as your (recip 2)). It's possible than changing the general multiplication by (recip 2) to scaling by a constant (dScale) would lessen the strain.

The last and embarrassing case is where a test says only "wrong result". It would be helpful to make it print any details.

sfindeisen commented 2 years ago

CI fails. In most of the cases, that's because your formula for tanh, while beautiful (no doubt, though I can't tell if it's better behaved numerically or if it's faster or slower and on what architectures), gives subtly different results than the old one. In some cases the whole computation is not numerically stable wrt to the tanh computation and then the difference becomes huge.

Right, and it also contains division (/). Is this the reason you're assuming your original formula is more accurate than mine?

However, not all failures are clear, so let's try to understand them before deciding on a fix. A couple of failures are probably due to my sparse matrix experiment not coping well with matrices of undetermined size (as your (recip 2)). It's possible than changing the general multiplication by (recip 2) to scaling by a constant (dScale) would lessen the strain.

Not that I understand, but I just eliminated (recip 2) from atanh, sinh and cosh, instead applying those functions to the primal component explicitly. Does this make sense? Why? :)

The last and embarrassing case is where a test says only "wrong result". It would be helpful to make it print any details.

Fixed the test to print out more info, on my machine it now produces:

Running 1 test suites...
Test suite shortTestForCI: RUNNING...
Short tests for CI
  MNIST RNN short tests
    randomLL2 140 0 3 5 54282 5.0e-2: FAIL (2.25s)
      test/common/TestMnistRNN.hs:173:
      wrong result: 30.061871495725264 is expected to be a member of [30.061871495723956,30.06187089990927]

which, again, is a subtle numerical error. But in which function and which number is closer to the truth?...

BTW why do those tests here on Github take 40 minutes to complete? Can we improve that?

sfindeisen commented 2 years ago

Not sure what to do with tan: all the forms I could find contain the division somewhere. For example, the first derivative: 1/cos^2.

Mikolaj commented 2 years ago

Right, and it also contains division (/). Is this the reason you're assuming your original formula is more accurate than mine?

I'm not assuming that. However, if I knew one of them is more accurate (or faster) and if I had to guess which, I'd assume mine, just because it's repeated on ML blogs, while yours is derived from calculus. Ultimately, checking the quality of our pre-defined functions is a separate task, so I wouldn't worry and settle on something uniform and cheap (e.g., not requiring a change of tests). Probably, avoiding division by zero is a good idea, too, but not at a too high complexity cost (I guess tan is fine; if users don't like it, let them propose a change). I'm learning this as I go, so I'm afraid I don't have good answers yet.

Not that I understand, but I just eliminated (recip 2) from atanh, sinh and cosh, instead applying those functions to the primal component explicitly. Does this make sense? Why? :)

Actually, what I had in mind was to change

cosh z = (recip 2) * ((exp z) + (exp (-z)))

to

cosh z = scale (recip 2) ((exp z) + (exp (-z)))

and this looks better to me, because it underlines that we are not using a bilinear function * that has a complex derivative, but we use a scaling by a constant that has a simple derivative. But your new version

cosh (D u u') = D (cosh u) (dScale (sinh u) u')

is even clearer to me, because it tell me what is the primal component and what is the dual component of the dual number (the derivative that, indeed, is a small expression tree). Please ask more, if what I said requires further clarification.

Fixed the test to print out more info, on my machine it now produces:
Running 1 test suites...
Test suite shortTestForCI: RUNNING...
Short tests for CI
  MNIST RNN short tests
    randomLL2 140 0 3 5 54282 5.0e-2: FAIL (2.25s)
      test/common/TestMnistRNN.hs:173:
      wrong result: 30.061871495725264 is expected to be a member of [30.061871495723956,30.06187089990927]
which, again, is a subtle numerical error. But in which function and which number is closer to the truth?...

Thank you for the fix. What is truth? I'd recommend to change the test to expect one of two results: the result you get at home and the result CI obtains with the same test (if it differs). Ignore the old results; they are close enough, so we probably haven't made an error.

BTW why do those tests here on Github take 40 minutes to complete? Can we improve that?

How much do they take on your machine? We can surely solve #14 and #20, but they are really hard (and #14 is almost finished).

Mikolaj commented 2 years ago

BTW, I've just merged a big PR to master. Apologies for the conflicts and for, probably, some extra boilerplate TODOs to do in DualNumber.hs.

The code on master is now hlint-clean except for a dozen of cases. You can use hlint to tell you which parens are unnecessary. But feel free to ignore or to silence via .hlint.yaml any hints that are not useful.

Mikolaj commented 2 years ago

Thank you for the boilerplate. I guess that's all the relevant TODOs? Could you rebase your branch instead of merge, to keep git history clean? We will then merge this PR with --no-ff to preserve PR number in git history.

sfindeisen commented 2 years ago

Thank you for the boilerplate. I guess that's all the relevant TODOs?

It looks so.

Could you rebase your branch instead of merge

Ups sorry, I just merged, then read this. Shall I make a new branch and replay my commits there?

sfindeisen commented 2 years ago

What is truth?

The real tanh result which should be there, if our floating point arithmetic was error-free. BTW, did you consider symbolic calculations and/or using your own arithmetic with infinite precision?

I'd recommend to change the test to expect one of two results: the result you get at home and the result CI obtains with the same test (if it differs).

No longer relevant since I rolled back my tanh implementation (the test is now green).

BTW why do those tests here on Github take 40 minutes to complete? Can we improve that?

How much do they take on your machine?

I had to stop them, but maybe I will run them again on some cold winter day.

Mikolaj commented 2 years ago

Ups sorry, I just merged, then read this. Shall I make a new branch and replay my commits there?

Oh, no, please don't bother, that's not a big deal. Also, if you rebase now, the merge commits are probably going to vanish (though you might have to solve a conflict or two manually, as is common with rebasing). Then just git push -f to this branch.

BTW, did you consider symbolic calculations and/or using your own arithmetic with infinite precision?

Not really because, as I'm slowly learnings, ML is fine with lack of precision and sometimes it's even good (introduces a fuzz that helps break away from local minima and not to overfit) and definitely not worth a big performance price to eliminate. I think ML applications even prefer smaller floats than 64 bits in order to lower peak memory. [Edit: it's only serious numerical instability that becomes a problem and for which, indeed, exact arithmetic would be a solution.]

I'd recommend to change the test to expect one of two results: the result you get at home and the result CI obtains with the same test (if it differs).

No longer relevant since I rolled back my tanh implementation (the test is now green).

Great. Thanks.

I had to stop them, but maybe I will run them again on some cold winter day.

I suggested in that ticket that running just the short test, the one that CI runs, may be a better idea than running both and concurrently. Did this suggestion make sense?

sfindeisen commented 2 years ago

Ups sorry, I just merged, then read this. Shall I make a new branch and replay my commits there?

Oh, no, please don't bother, that's not a big deal. Also, if you rebase now, the merge commits are probably going to vanish (though you might have to solve a conflict or two manually, as is common with rebasing). Then just git push -f to this branch.

Done.

BTW, did you consider symbolic calculations and/or using your own arithmetic with infinite precision?

Not really because, as I'm slowly learnings, ML is fine with lack of precision and sometimes it's even good (introduces a fuzz that helps break away from local minima and not to overfit) and definitely not worth a big performance price to eliminate. I think ML applications even prefer smaller floats than 64 bits in order to lower peak memory. [Edit: it's only serious numerical instability that becomes a problem and for which, indeed, exact arithmetic would be a solution.]

I was expecting that. Now the question is, do we ever experience those serious errors.

Anything else to do in this ticket?

Mikolaj commented 2 years ago

Anything else to do in this ticket?

Let me check the comments...

I think only this comment remains, if it makes sense to you (please ask if anything is unclear):

The code on master is now hlint-clean except for a dozen of cases. You can use hlint to tell you which parens are unnecessary. But feel free to ignore or to silence via .hlint.yaml any hints that are not useful.

Mikolaj / horde-ad

TODOs in HordeAd.Core.DualNumber about defining more derivatives of common functions #39