ekmett / ad

Automatic Differentiation
http://hackage.haskell.org/package/ad
BSD 3-Clause "New" or "Revised" License
368 stars 73 forks source link

`ReverseDouble` with `+ffi` produces different result for square root of product #104

Closed julmb closed 1 year ago

julmb commented 1 year ago

I have the following example:

{-# LANGUAGE NoMonomorphismRestriction #-}

import qualified Numeric.AD.Mode.Forward as F
import qualified Numeric.AD.Mode.Forward.Double as FD
import qualified Numeric.AD.Mode.Reverse as R
import qualified Numeric.AD.Mode.Reverse.Double as RD

main :: IO ()
main = do
    let f = sqrt . product
    let x = [0, 1] :: [Double]
    print $ F.grad f x
    print $ FD.grad f x
    print $ R.grad f x
    print $ RD.grad f x

When run, this prints:

[Infinity,NaN]
[Infinity,NaN]
[Infinity,NaN]
[Infinity,NaN]

When run with the +ffi flag, it prints:

[Infinity,NaN]
[Infinity,NaN]
[Infinity,NaN]
[Infinity,0.0]

The partial derivatives here should be:

d sqrt (x * y) / dx = y / (2 * sqrt (x * y))
d sqrt (x * y) / dy = x / (2 * sqrt (x * y))

So grad f [0, 1] should be [1 / 0, 0 / 0]. In order to be consistent with how IEEE floating point arithmetic works, the result [Infinity,NaN] is the one I would consider correct, and the value [Infinity,0.0] that the +ffi version calculates seems strange.

julmb commented 1 year ago

I looked into this a little today. My findings so far:

The function Numeric.AD.Internal.Reverse.Double.partials does not behave the same for the pure and the +ffi version. In my example, the pure version returns [Infinity,NaN,0.0], while the +ffi version returns [Infinity,0.0]. It is very concerning to me that these are not even the same length! It seems that the number of items returned by the pure version is calculated as part of the backPropagate function, while the +ffi version always returns pTape->variables many items.

The way these results are subsequently used in Numeric.AD.Internal.Reverse.Double.partialArrayOf pads them out with zeroes, so that the result of this function is array (0,2) [(0,Infinity),(1,NaN),(2,0.0)] for the pure version and array (0,2) [(0,Infinity),(1,0.0),(2,0.0)] for the +ffi version. It seems that the actual grad function then uses the first two items of this array.

I also looked at the tapes. The pure version produces the tape Head 5 (Unary 4 Infinity (Binary 3 1 1.0 0.0 (Unary 0 1.0 Nil))). The +ffi version produces a tape data structure that looks like this:

tape->idx 3
tape->offset 2
tape->size 4096
tape->variables 2
tape->val 0xa5b1b8
tape->lnk 0xa6b1b8
tape->prev (nil)
tape->val[0, 1] 1.000000 0.000000
tape->lnk[0, 1] 0 0
tape->val[2, 3] 1.000000 0.000000
tape->lnk[2, 3] 2 1
tape->val[4, 5] inf 0.000000
tape->lnk[4, 5] 3 0

I believe that this would roughly translate to Unary 3 Infinity (Binary 2 1 1.0 0.0 (Unary 0 1.0 Nil)). So it seems like the indexing is different here.

julmb commented 1 year ago

After some more digging, the issue seems to be in this block of code: https://github.com/ekmett/ad/blob/8155b0e2876475996cc25245a894bdbe7f80efe0/cbits/tape.c#L90-L105 The checks if (x != 0.0) and if (y != 0.0) appear to be optimizations, guarding the increments buffer[i] += v*x; and buffer[j] += v*y; from being executed if the second factor is zero, as this would imply that the product is also zero and thus the increment operation would do nothing. However, this assumption does not hold in the presence of IEEE floating point special values, as for example Infinity * 0 = NaN, and thus the increment operation can indeed have an effect.

I suspect that the if (v == 0.0) continue; check could have a similar effect if buffer contains special values, but I have not tried to come up with an example for this yet.

I am not sure how to fix this issue, as I do not know why these optimizations were introduced or how much time they actually save. Removing all three of the conditionals makes the discrepancy go away, but there might be a performance penalty. Checking for NaN values as part of the conditionals should also work, but that seems error prone and the code would not be "obviously correct". Furthermore, that could make the check itself as expensive as the operation it intends to optimize away, so I do not know if that is a good idea.

cartazio commented 1 year ago

Maybe the condition should be vx != 0 and vy != 0?

That would capture the intent, not save on the multiply, but still save on writes.

On Wed, Jun 21, 2023 at 12:15 PM Julian Brunner @.***> wrote:

After some more digging, the issue seems to be in this block of code:

https://github.com/ekmett/ad/blob/8155b0e2876475996cc25245a894bdbe7f80efe0/cbits/tape.c#L90-L105 The checks if (x != 0.0) and if (y != 0.0) appear to be optimizations, guarding the increments buffer[i] += vx; and buffer[j] += vy; from being executed if the second factor is zero, as this would imply that the product is also zero and thus the increment operation would do nothing. However, this assumption does not hold in the presence of IEEE floating point special values, as for example Infinity * 0 = NaN, and thus the increment operation can indeed have an effect.

I suspect that the if (v == 0.0) continue; check could have a similar effect if buffer contains special values, but I have not tried to come up with an example for this yet.

I am not sure how to fix this issue, as I do not know why these optimizations were introduced or how much time they actually save. Removing all three of the conditionals makes the discrepancy go away, but there might be a performance penalty. Checking for NaN values as part of the conditionals should also work, but that seems error prone and the code would not be "obviously correct". Furthermore, that could make the check itself as expensive as the operation it intends to optimize away, so I do not know if that is a good idea.

— Reply to this email directly, view it on GitHub https://github.com/ekmett/ad/issues/104#issuecomment-1601139579, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQWYCVPLTKYKVJFNMFTXMMMZVANCNFSM6AAAAAAZGPERAM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

cartazio commented 1 year ago

I see what you mean, when would v have either infinity or nan as a value?

On Wed, Jun 21, 2023 at 12:55 PM Carter Schonwald < @.***> wrote:

Maybe the condition should be vx != 0 and vy != 0?

That would capture the intent, not save on the multiply, but still save on writes.

On Wed, Jun 21, 2023 at 12:15 PM Julian Brunner @.***> wrote:

After some more digging, the issue seems to be in this block of code:

https://github.com/ekmett/ad/blob/8155b0e2876475996cc25245a894bdbe7f80efe0/cbits/tape.c#L90-L105 The checks if (x != 0.0) and if (y != 0.0) appear to be optimizations, guarding the increments buffer[i] += vx; and buffer[j] += vy; from being executed if the second factor is zero, as this would imply that the product is also zero and thus the increment operation would do nothing. However, this assumption does not hold in the presence of IEEE floating point special values, as for example Infinity * 0 = NaN, and thus the increment operation can indeed have an effect.

I suspect that the if (v == 0.0) continue; check could have a similar effect if buffer contains special values, but I have not tried to come up with an example for this yet.

I am not sure how to fix this issue, as I do not know why these optimizations were introduced or how much time they actually save. Removing all three of the conditionals makes the discrepancy go away, but there might be a performance penalty. Checking for NaN values as part of the conditionals should also work, but that seems error prone and the code would not be "obviously correct". Furthermore, that could make the check itself as expensive as the operation it intends to optimize away, so I do not know if that is a good idea.

— Reply to this email directly, view it on GitHub https://github.com/ekmett/ad/issues/104#issuecomment-1601139579, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQWYCVPLTKYKVJFNMFTXMMMZVANCNFSM6AAAAAAZGPERAM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RyanGlScott commented 1 year ago

@sofusmortensen, as the author of the FFI code, do you have any insight into why these if (x != 0.0)/if (y != 0.0)` checks are present?

julmb commented 1 year ago

I see what you mean, when would v have either infinity or nan as a value?

This happens in the example in the original post of this issue, where buffer will contain Infinity as the partial derivative d sqrt (x * 1) / d x at x = 0.

julmb commented 1 year ago

It turns out this is even more complicated than I thought. While the pure implementation has separate constructors for unary and binary operation entries on the tape, the +ffi implementation encodes unary operations as binary operations with the second operand set to variable index 0 and a partial derivative of 0.0.

This means that the buffer entry for variable 0 can be inadvertently set to NaN by this "fake" binary entry. This was previously caught by the if (y != 0.0) check, although I am not sure this was deliberate, as the completely symmetrical if (x != 0.0) check serves no such purpose.

Thus, in the absence of the conditionals, the example works for x = [1, 0], but it does not work for x = [0, 1], giving a different (but still wrong) result of [NaN,NaN].

My intuition is that this encoding is fundamentally ambiguous and cannot be handled properly in the presence of IEEE floating point special values. Given an entry Binary 1 0 1.0 0.0, I see no way of knowing whether to treat this as a binary operation between variables 1 and 0, which would require multiplying by 0.0 to get the correct behaviour for special values and a unary operation on variable 1, which would require ignoring the second part of the entry entirely to get the correct behavior.

RyanGlScott commented 1 year ago

My intuition is that this encoding is fundamentally ambiguous and cannot be handled properly in the presence of IEEE floating point special values.

Alright. Do you think that this pitfall is worth advertising in the Haddocks somewhere?

julmb commented 1 year ago

Well, ideally, I think we should fix the problem so that all implementations yield the same results.

The variable indices in the C struct are of type int, so we could use -1 instead of 0 to indicate that an operand is not present (as is the case with a unary tape entry). It's not pretty, but the function tape_backPropagate is the only place that needs to take this into account, so it should be a fairly minor change. The index -1 can never conflict with a real variable, so we can then reliably check whether to incorporate a value into the calculation or not, rather than relying on the == 0.0 check.

Once that change is in place, the code could be adjusted to handle IEEE floating point special values correctly.

If that sounds like an acceptable way of dealing with this, I could probably create a pull request in the next few days (or a separate pull request for the refactoring and the bugfix, if that would be preferable).

RyanGlScott commented 1 year ago

OK. I trust your judgment on this, so I'd welcome a PR that changes the FFI implementation to use -1 as placeholder values for missing operands. (Also, feel free to submit a single PR that does the refactoring and bugfix in separate commits.)

julmb commented 1 year ago

I am still working on the pull request. In the mean time, I have found another example that causes an issue. The expression RD.grad (\ [x, y] -> sqrt x * sqrt y) [0, 0] evaluates to [0.0,0.0] when using +ffi. The correct result, which is yielded by the other modes, is [NaN,NaN].

This is due to the if (v == 0.0) continue; statement skipping the loop body while v = 0 and x = inf, which is not correct, since v * x would have resulted in NaN, so the statement cannot be skipped.

cartazio commented 1 year ago

So this is another example where the guard should be the product?

cartazio commented 1 year ago

How can I help you on this? Also this is wonderful sleuthing

julmb commented 1 year ago

So this is another example where the guard should be the product?

Pretty much, although there are two products involved here (v*x and v*y).

After performing the previously suggested refactoring involving -1 indices, I tried removing both the if (x != 0.0) and if (y != 0.0) optimizations (A) as well as the the if (v == 0.0) optimization (B). Removing both A and B resolves all the inconsistencies in the +ffi version of Reverse.Double that I have observed so far. Unfortunately, there are some performance penalties involved.

Removing optimization A had no measurable performance impact in the bench/BlackScholes.hs benchmark or in my application. Removing optimization B had no measurable performance impact in the bench/BlackScholes.hs benchmark, but it did slow down my application by a factor of almost 2 (calculating a Hessian went from 2.1s to 3.9s). It turns out that the if (v == 0.0) guard was able to skip 80% of the backpropagation steps (despite the resulting Hessian being a dense 56x56 matrix with no zero entries) and this had a significant impact on performance.

This is very unfortunate, because it means that if we want correctness, then about half the performance is lost in checking for a condition that should probably never arise in normal operation anyway (assuming that not many people would want to calculate gradients with NaN values in them).

As an aside, I suspect that this zero check could also be used to speed up the non-ffi version of Reverse.Double in a similar way, albeit with the same correctness issues in the presence of IEEE floating point special values.

How can I help you on this? Also this is wonderful sleuthing

Thank you! I am not sure how to proceed from here, so any feedback would be appreciated. I will do some more thinking and probably additional tests tomorrow to see if anything can be done to achieve both correctness and good performance.

cartazio commented 1 year ago

Did you measure with doing the if then with the equality test against the product? That way you preserve the tape sparsity? It’s not the multiply that’s expensive. It’s writing down add zero onto the tape I think.

Throw up a pr that does that version and we can review it. It should be sans perf regressions though if you do some measurements that’d be appreciated.

On Tue, Jul 4, 2023 at 11:40 AM Julian Brunner @.***> wrote:

So this is another example where the guard should be the product?

Pretty much, although there are two products involved here (vx and vy).

After performing the previously suggested refactoring involving -1 indices, I tried removing both the if (x != 0.0) and if (y != 0.0) optimizations (A) as well as the the if (v == 0.0) optimization (B). Removing both A and B resolves all the inconsistencies in the +ffi version of Reverse.Double that I have observed so far. Unfortunately, there are some performance penalties involved.

Removing optimization A had no measurable performance impact in the bench/BlackScholes.hs benchmark or in my application. Removing optimization B had no measurable performance impact in the bench/BlackScholes.hs benchmark, but it did slow down my application by a factor of almost 2 (calculating a Hessian went from 2.1s to 3.9s). It turns out that the if (v == 0.0) guard was able to skip 80% of the backpropagation steps (despite the resulting Hessian being a dense 56x56 matrix with no zero entries) and this had a significant impact on performance.

This is very unfortunate, because it means that if we want correctness, then about half the performance is lost in checking for a condition that should probably never arise in normal operation anyway (assuming that not many people would want to calculate gradients with NaN values in them).

As an aside, I suspect that this zero check could also be used to speed up the non-ffi version of Reverse.Double in a similar way, albeit with the same correctness issues in the presence of IEEE floating point special values.

How can I help you on this? Also this is wonderful sleuthing

Thank you! I am not sure how to proceed from here, so any feedback would be appreciated. I will do some more thinking and probably additional tests tomorrow to see if anything can be done to achieve both correctness and good performance.

— Reply to this email directly, view it on GitHub https://github.com/ekmett/ad/issues/104#issuecomment-1620463497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQS2W3LZLCEAPHZO2YTXOQ2OXANCNFSM6AAAAAAZGPERAM . You are receiving this because you commented.Message ID: @.***>

julmb commented 1 year ago

Did you measure with doing the if then with the equality test against the product? That way you preserve the tape sparsity? It’s not the multiply that’s expensive. It’s writing down add zero onto the tape I think.

You are right, checking for zeroness of v*x and v*y does make a big difference. I thought that since even with these checks, 4 out of the 6 memory accesses would still happen, there would not be much to gain. But it looks like the reads (which are mostly sequential, although backwards) are much cheaper than the writes. There is still a factor 1.2 performance impact though.

I will submit a pull request so we can discuss what should be done.

cartazio commented 1 year ago

awesome. I think you're measuring worst case overhead, rather than average case. but either way, look forward to the PR