astro-group-bristol / Gradus.jl

Extensible spacetime agnostic general relativistic ray-tracing (GRRT).
https://astro-group-bristol.github.io/Gradus.jl/dev/
GNU General Public License v3.0
18 stars 2 forks source link

ProgressMeter is not thread safe and overflows/race conditions with EnsembleEndpointThreads #87

Closed fjebaker closed 1 year ago

fjebaker commented 1 year ago

There is a relatively small chance that this happens:

    nested task error: Something went wrong. Integrator stepped past tstops but the algorithm was dtchangeable. Please report this error.
    Stacktrace:
     [1] error(s::String)
       @ Base ./error.jl:35
     [2] handle_tstop! ...

Given that it is non-deterministic, it is almost certainly related to some threading issue and a race condition being hit.


Turns out this is all related to weird overflows with the progress meter. Commenting out any and all progress meter calls fixes everything, so we need some sort of thread safety here / it is likely that copying the meter state per threads is the culprit.

fjebaker commented 1 year ago

It also happens that some of the rays just seem to terminate immediately?

using Gradus

m = KerrMetric(1.0, 1.0)
u = SVector(0.0, 1000.0, deg2rad(90), 0.0)

scale = 4
img = @time rendergeodesics(
    m, 
    u, 
    2000.0, 
    image_width = 200 * scale, 
    image_height = 200 * scale, 
    fov_factor = 14.0 * scale, 
    verbose = true, 
    closest_approach = 1.001
)

using Plots
heatmap(img, aspect_ratio=1)

has a small chance of dead pixels.

fjebaker commented 1 year ago

Original issue can be resolved by deepcopying the solver here:

https://github.com/astro-group-bristol/Gradus.jl/blob/a0e82d4da721dc5cf7739cacc49ee2db401b750e/src/tracing/tracing.jl#L208-L215

Edit: That did not entirely fix it, crash still happened 1 out of 10 runs.

fjebaker commented 1 year ago

Can confirm none of this happens with ensemble = EnsembleThreads().

fjebaker commented 1 year ago

Deepcopying the integrator seems to have fixed the crash, but the dead pixels still exist. Perhaps this is something in the process_solution?

Edit: nevermind, not quite:

 nested task error: BoundsError: attempt to access 0-element Vector{Float64} at index [1]
    Stacktrace:
      [1] getindex
        @ ./array.jl:924 [inlined]
      [2] heappop!(xs::Vector{Float64}, o::DataStructures.FasterForward)
        @ DataStructures ~/.julia/packages/DataStructures/59MD0/src/heaps/arrays_as_heaps.jl:57
      [3] pop!
        @ ~/.julia/packages/DataStructures/59MD0/src/heaps/binary_heap.jl:107 [inlined]
      [4] pop_tstop!
        @ ~/.julia/packages/OrdinaryDiffEq/pIBDs/src/integrators/integrator_interface.jl:195 [inlined]
      [5] handle_tstop!(
fjebaker commented 1 year ago

verbose = false fixes (almost) everything, so it must be the call to the progress bar is somehow overflowing or triggering weird race conditions?