Debugging: detecting `NaN`'s

bartvanerp commented 1 year ago

The situation might occur when the inference function successfully completes, but its results only contain NaN's. Now it is impossible to trace back the origin of the very first NaN without perform a lot of manual work. This limits the ability to debug the code and to prevent these NaN's in the first place.

It should become possible to throw an error once the very first NaN is encountered, such that the code can be improved to prevent this from happening. Addons might be a suitable solution here.

ashton314 commented 1 year ago

Do you have an example where this might happen? If I have an example, I might be able to help with this. I'm researching some tools that make this kind of thing easier.

bartvanerp commented 1 year ago

Hi @ashton314! Thanks for thinking along. Unfortunately the code that is experiencing the issue is proprietary, so I cannot share this. And as I don't know what is causing the issue, I cannot create a minimal working example. Once I find the issue, I will share a minimal working example with you.

We already have an idea on how to create a detection mechanism for NaNs in RxInfer, but I am curious about the tools that you are developing. Could you perhaps elaborate on this, or share a link to this tool?

ashton314 commented 1 year ago

Yeah! Over at the University of Utah we're developing FloatTracker as part of some tooling to make numerical computing better. This is alpha research software, so here there be dragons. ;) You are welcome to use it, on the condition that you tell us lots about your experience with using it! We've been working on uncovering bugs in various libraries—we really want to know how it performs.

Here is an example of how to use it. I'd be happy to help you with getting it set up.

The short of it is this: FloatTracker provides a TrackedFloat type wrapper for various precisions: TrackedFloat16, TrackedFloat32, and TrackedFloat64.

You either configure your library to use a TrackedFloat type (like you see with the ShallowWaters.jl example linked above) or you wrap your inputs with TrackedFloat, and that type should percolate to most/all areas of interesting in the code.

Set up your logging to tell you when NaN generation is encountered:

set_logger(filename="whatever", buffersize=1000, cstg=true, cstgArgs=true, cstgLineNum=true)
set_exclude_stacktrace([:prop])  # You can remove this if you want to see NaN propagation—it's expensive though

...

write_out_logs()           # flush

The logs should have some interesting data for you now. :) We're working on some visualization tools as well; see CSTG if you're feeling brave.

Please remember that this is pretty experimental software, and please tell us what your experience with it is like! We need the examples desperately.

ashton314 commented 1 year ago

If you have some benchmarks or something like a MWE (doesn't have to produce NaNs, just have the shape of the problem you're solving) we might be able to help point you in the right direction. Please do still try out FloatTracker on your specific problem and let us know how that goes.

Ping: @bennn

bvdmitri commented 1 year ago

@ashton314 nice work! We should definitely give it a try, we usually get some NaNs from matrix-inversion instabilities and its quite hard to track in a long-running process

bartvanerp commented 1 year ago

Hi @ashton314, that is actually very neat! I was just playing around with your tool and came across some things which I was struggling with. Some are purely based on my own stupidity, but might be to take into consideration for user-friendliness. I will add more comments along the way, but for now I had the following points:

Would it also work for Int types? In our examples we are namely not always dealing with Floats. I understand that you might be a bit biased by the name of the package ;).
Now every Float has to be converted manually, where you need to be careful about broadcasting (e.g. TrackedFloat64.([1.0, 1.0])). Would it be possible to simplify the tracking (e.g. track([1.0, 1.0])), where the code itself figures out what bit-size/type it should be. This is especially useful for nested vectors, something that we deal with once in a while. Now it is very difficult to convert those.
Perhaps you can implement a macro that simplifies the use of your tool for users, e.g. @track foo(...) that automatically convert the arguments in the function to TrackedFloats.
Following up on the above, perhaps you could extend your functionality to finding Infs.
When running the code it seems to be failing rather quickly. In all our example we have so-called datavars in our model specification language, of which we currently have to specify the datatype explicitly (although we intend to drop this in the future). Usually we just pick Float64 here, but TrackedFloat64 is not a subtype and therefore it fails. All function that are typed with FloatN seem to fail, e.g.
```
foo(x::Float64) = 1
```
```
foo(TrackedFloat64(1.0)) #fails
```

bvdmitri commented 1 year ago

@bartvanerp I suppose the TrackedFloat64 works similarly to Dual numbers from ForwardDiff and ForwardDiff will also fail to differentiate the foo(x::Float64) because it is not possible to subtype from Float64 (only from Real). The last problem with datavar is really something that we probably should fix on our side. Perhaps you can try to create datavar(Real) instead as a workaround? Or perhaps datavar(Union{Float64, TrackedFloat64})?

bvdmitri commented 1 year ago

IMO if we can differentiate our code with ForwardDiff (which we usually can do in many situations) then I would expect TrackedFloats to work as well. But datavars could be an extra issue indeed.

bartvanerp commented 1 year ago

@bvdmitri you are right. The disclaimer about my own stupidity was therefore appropriate. I was able to change it to datavar(TrackedFloat64) and it now runs. I couldn't get it running with datavar(Vector{<:Union{Float64, TrackedFloat64}}), but this is likely an issue on our end and will be obsolete soon anyway.

@ashton314 @bennn Great package! I already found the location where a NaN was produced. I did have an issue with the logger, as it just keeps on accumulating and never stops. In our example we do not know when the NaN takes place, so we can't stop the computations directly after. As a result everything after the first NaN got logged, making it very slow and requiring me to restart Julia. It would be a very nice options if we could only log the first K occurrences of NaNs. In our case we are also interested in the actual values (or distributions) causing the NaN, such that we get some more insights in what is causing the issue. It would be great if these could also be logged (perhaps defaulted to false).

bartvanerp commented 1 year ago

Also truncating the stack trace would be a very nice option. Our toolbox is based on Rocket.jl, a reactive programming paradigm, which can create huge stack traces. The first log of a NaN was by itself already 1700 lines ;)

ashton314 commented 1 year ago

I just noticed your latest comments on this thread—I will reply to that soon. I'm so glad it's working alright for you!!

↓ Original comment below ↓

that is actually very neat!

😄 I'm glad you think so! I can't take credit for the original idea—I'm taking over for a master's student who just graduated.

take into consideration for user-friendliness

Suggestions welcome. :) There's a long way to go to get this "user-friendly". Moreover, there aren't a lot of us working on this (mostly just me, actually) so it will take some time. Thank you for trying it out!

Let's see if I can answer some questions here.

Would it also work for Int types?

I think I see why you might need this. But, is there a reason you couldn't work with the float value? Are you using really large integers that aren't representable exactly with floating-point? Efficiency issues?

Again, I can see why that would be nice—but unless it's really pressing this would probably be a lower-priority issue for us until we fry our bigger fish.

Now every Float has to be converted manually

You shouldn't have to—it's a "sticky" type, meaning if you do something like:

foo = 42.0                  # This is a Float64
bar = TrackedFloat64(12.0)

typeof(foo + bar)           # ⇒ returns "TrackedFloat64"

So if you wanted, you could start by just wrapping some of your inputs in TrackedFloat types, and then see if you get some results. If not, then you can wrap some more places and then move on.

That said, making some convenience wrappers to handle common data types shouldn't be too hard. That's a good idea.

extend functionality to finding Infs

Yes, that's somewhere in the roadmap. (Not that it's written down, but that's something we've thought about.) I don't know if we'll get around to this one soon… we've got some ideas on how to take bigger chunks out of the general problem here, and Infs and NaNs would all get handled in some uniform way. We'll see though. It's good to know that that's a pain point for you.

datavars

You might want to make a little wrapper to accommodate that. E.g.:


foo(x::TrackedFloat64) = TrackedFloat64(foo(x.val))

Some helpers for that might be good for us to make…

The issue with the subtyping is Julia only allows you to subtype abstract types, as I think @bvdmitri mentioned, so TrackedFloatN is a subtype of AbstractFloat.

ashton314 commented 1 year ago

@bvdmitri you are right. … I was able to change it to datavar(TrackedFloat64) and it now runs. … this is likely an issue on our end and will be obsolete soon anyway.

Glad you found a temporary fix for that!

I already found the location where a NaN was produced.

Woohoo!

I did have an issue with the logger, as it just keeps on accumulating and never stops.

Does it loop forever? Or just get a whole lot slower? We're aware of the performance overhead—we're working on that, but for right now it's pretty massive.

It would be a very nice options if we could only log the first K occurrences of NaNs. In our case we are also interested in the actual values (or distributions) causing the NaN, such that we get some more insights in what is causing the issue. It would be great if these could also be logged (perhaps defaulted to false).

Solid suggestion. Thank you!

bennn commented 1 year ago

@bartvanerp let us know if that first NaN log turns out to be useful, or if it could use more context info beyond the stack trace.

(We'll limit trace length by default soon.)

ashton314 commented 1 year ago

Just a heads up: I've made some substantial changes to the API for FloatTracker. We're now using semantic versioning, as of v0.1.0, there are new ways to configure FloatTracker, as outlined in our Changelog. I'm happy to answer any questions.

While we don't have any new features, this refactor will make adding things (e.g. limit trace length) a lot easier to manage. :) Thanks again for all your feedback!

bvdmitri commented 1 year ago

@ashton314 We are certainly looking into your package. Looks very cool. As for this issue @bartvanerp implemented an addon in a PR. The issue has been closed automatically as soon as I merged the PR.

ReactiveBayes / RxInfer.jl

Debugging: detecting `NaN`'s #116