Closed bartvanerp closed 1 year ago
Do you have an example where this might happen? If I have an example, I might be able to help with this. I'm researching some tools that make this kind of thing easier.
Hi @ashton314! Thanks for thinking along. Unfortunately the code that is experiencing the issue is proprietary, so I cannot share this. And as I don't know what is causing the issue, I cannot create a minimal working example. Once I find the issue, I will share a minimal working example with you.
We already have an idea on how to create a detection mechanism for NaN
s in RxInfer
, but I am curious about the tools that you are developing. Could you perhaps elaborate on this, or share a link to this tool?
Yeah! Over at the University of Utah we're developing FloatTracker as part of some tooling to make numerical computing better. This is alpha research software, so here there be dragons. ;) You are welcome to use it, on the condition that you tell us lots about your experience with using it! We've been working on uncovering bugs in various libraries—we really want to know how it performs.
Here is an example of how to use it. I'd be happy to help you with getting it set up.
The short of it is this: FloatTracker provides a TrackedFloat
type wrapper for various precisions: TrackedFloat16
, TrackedFloat32
, and TrackedFloat64
.
You either configure your library to use a TrackedFloat
type (like you see with the ShallowWaters.jl example linked above) or you wrap your inputs with TrackedFloat
, and that type should percolate to most/all areas of interesting in the code.
Set up your logging to tell you when NaN generation is encountered:
set_logger(filename="whatever", buffersize=1000, cstg=true, cstgArgs=true, cstgLineNum=true)
set_exclude_stacktrace([:prop]) # You can remove this if you want to see NaN propagation—it's expensive though
...
write_out_logs() # flush
The logs should have some interesting data for you now. :) We're working on some visualization tools as well; see CSTG if you're feeling brave.
Please remember that this is pretty experimental software, and please tell us what your experience with it is like! We need the examples desperately.
If you have some benchmarks or something like a MWE (doesn't have to produce NaNs, just have the shape of the problem you're solving) we might be able to help point you in the right direction. Please do still try out FloatTracker on your specific problem and let us know how that goes.
Ping: @bennn
@ashton314 nice work! We should definitely give it a try, we usually get some NaNs from matrix-inversion instabilities and its quite hard to track in a long-running process
Hi @ashton314, that is actually very neat! I was just playing around with your tool and came across some things which I was struggling with. Some are purely based on my own stupidity, but might be to take into consideration for user-friendliness. I will add more comments along the way, but for now I had the following points:
Would it also work for Int
types? In our examples we are namely not always dealing with Float
s. I understand that you might be a bit biased by the name of the package ;).
Now every Float
has to be converted manually, where you need to be careful about broadcasting (e.g. TrackedFloat64.([1.0, 1.0])
). Would it be possible to simplify the tracking (e.g. track([1.0, 1.0])
), where the code itself figures out what bit-size/type it should be. This is especially useful for nested vectors, something that we deal with once in a while. Now it is very difficult to convert those.
Perhaps you can implement a macro that simplifies the use of your tool for users, e.g. @track foo(...)
that automatically convert the arguments in the function to TrackedFloats
.
Following up on the above, perhaps you could extend your functionality to finding Inf
s.
When running the code it seems to be failing rather quickly. In all our example we have so-called datavars
in our model specification language, of which we currently have to specify the datatype explicitly (although we intend to drop this in the future). Usually we just pick Float64
here, but TrackedFloat64
is not a subtype and therefore it fails. All function that are typed with FloatN
seem to fail, e.g.
foo(x::Float64) = 1
foo(TrackedFloat64(1.0)) #fails
@bartvanerp I suppose the TrackedFloat64
works similarly to Dual
numbers from ForwardDiff
and ForwardDiff
will also fail to differentiate the foo(x::Float64)
because it is not possible to subtype from Float64
(only from Real
). The last problem with datavar
is really something that we probably should fix on our side. Perhaps you can try to create datavar(Real)
instead as a workaround? Or perhaps datavar(Union{Float64, TrackedFloat64})
?
IMO if we can differentiate our code with ForwardDiff
(which we usually can do in many situations) then I would expect TrackedFloats
to work as well. But datavars
could be an extra issue indeed.
@bvdmitri you are right. The disclaimer about my own stupidity was therefore appropriate. I was able to change it to datavar(TrackedFloat64)
and it now runs. I couldn't get it running with datavar(Vector{<:Union{Float64, TrackedFloat64}})
, but this is likely an issue on our end and will be obsolete soon anyway.
@ashton314 @bennn Great package! I already found the location where a NaN
was produced. I did have an issue with the logger, as it just keeps on accumulating and never stops. In our example we do not know when the NaN
takes place, so we can't stop the computations directly after. As a result everything after the first NaN
got logged, making it very slow and requiring me to restart Julia. It would be a very nice options if we could only log the first K occurrences of NaN
s.
In our case we are also interested in the actual values (or distributions) causing the NaN
, such that we get some more insights in what is causing the issue. It would be great if these could also be logged (perhaps defaulted to false).
Also truncating the stack trace would be a very nice option. Our toolbox is based on Rocket.jl
, a reactive programming paradigm, which can create huge stack traces. The first log of a NaN was by itself already 1700 lines ;)
I just noticed your latest comments on this thread—I will reply to that soon. I'm so glad it's working alright for you!!
↓ Original comment below ↓
that is actually very neat!
😄 I'm glad you think so! I can't take credit for the original idea—I'm taking over for a master's student who just graduated.
take into consideration for user-friendliness
Suggestions welcome. :) There's a long way to go to get this "user-friendly". Moreover, there aren't a lot of us working on this (mostly just me, actually) so it will take some time. Thank you for trying it out!
Let's see if I can answer some questions here.
Would it also work for
Int
types?
I think I see why you might need this. But, is there a reason you couldn't work with the float value? Are you using really large integers that aren't representable exactly with floating-point? Efficiency issues?
Again, I can see why that would be nice—but unless it's really pressing this would probably be a lower-priority issue for us until we fry our bigger fish.
Now every Float has to be converted manually
You shouldn't have to—it's a "sticky" type, meaning if you do something like:
foo = 42.0 # This is a Float64
bar = TrackedFloat64(12.0)
typeof(foo + bar) # ⇒ returns "TrackedFloat64"
So if you wanted, you could start by just wrapping some of your inputs in TrackedFloat
types, and then see if you get some results. If not, then you can wrap some more places and then move on.
That said, making some convenience wrappers to handle common data types shouldn't be too hard. That's a good idea.
extend functionality to finding
Inf
s
Yes, that's somewhere in the roadmap. (Not that it's written down, but that's something we've thought about.) I don't know if we'll get around to this one soon… we've got some ideas on how to take bigger chunks out of the general problem here, and Inf
s and NaN
s would all get handled in some uniform way. We'll see though. It's good to know that that's a pain point for you.
datavars
You might want to make a little wrapper to accommodate that. E.g.:
foo(x::TrackedFloat64) = TrackedFloat64(foo(x.val))
Some helpers for that might be good for us to make…
The issue with the subtyping is Julia only allows you to subtype abstract types, as I think @bvdmitri mentioned, so TrackedFloatN
is a subtype of AbstractFloat
.
@bvdmitri you are right. … I was able to change it to
datavar(TrackedFloat64)
and it now runs. … this is likely an issue on our end and will be obsolete soon anyway.
Glad you found a temporary fix for that!
I already found the location where a
NaN
was produced.
Woohoo!
I did have an issue with the logger, as it just keeps on accumulating and never stops.
Does it loop forever? Or just get a whole lot slower? We're aware of the performance overhead—we're working on that, but for right now it's pretty massive.
It would be a very nice options if we could only log the first K occurrences of
NaN
s. In our case we are also interested in the actual values (or distributions) causing theNaN
, such that we get some more insights in what is causing the issue. It would be great if these could also be logged (perhaps defaulted to false).
Solid suggestion. Thank you!
@bartvanerp let us know if that first NaN log turns out to be useful, or if it could use more context info beyond the stack trace.
(We'll limit trace length by default soon.)
Just a heads up: I've made some substantial changes to the API for FloatTracker. We're now using semantic versioning, as of v0.1.0, there are new ways to configure FloatTracker, as outlined in our Changelog. I'm happy to answer any questions.
While we don't have any new features, this refactor will make adding things (e.g. limit trace length) a lot easier to manage. :) Thanks again for all your feedback!
@ashton314 We are certainly looking into your package. Looks very cool. As for this issue @bartvanerp implemented an addon in a PR. The issue has been closed automatically as soon as I merged the PR.
The situation might occur when the inference function successfully completes, but its results only contain
NaN
's. Now it is impossible to trace back the origin of the very firstNaN
without perform a lot of manual work. This limits the ability to debug the code and to prevent theseNaN
's in the first place.It should become possible to throw an error once the very first
NaN
is encountered, such that the code can be improved to prevent this from happening. Addons might be a suitable solution here.