JuliaParallel / Dagger.jl

A framework for out-of-core and parallel execution
Other
621 stars 67 forks source link

Error: `"jl_set_task_tid == 0"` when calling `length(::DTable)` #459

Closed StevenWhitaker closed 8 months ago

StevenWhitaker commented 8 months ago

I ran across an error when trying to run some code on a different machine. I don't know if this error has occurred on my laptop; if it has, it is not (yet) reproducible and only occurs rarely, whereas on the other machine the error happens every time. My laptop runs Windows 11 (WSL 2); the other machine runs some form of Linux.

Here is the (reformatted) error:

From worker 2:    [ 2023-11-23T15:24:36.744 ] pid: 524 proc: 2 Info:  Dagger.ThunkFailedException{Dagger.ThunkFailedException{RemoteException}}(
    Thunk[7](#14, Any[Thunk[6](_file_load, Any["/path/to/file.csv", MyPackage.var"#11#12"(), DataFrames.DataFrame])]),
    Thunk[6](_file_load, Any["/path/to/file.csv", MyPackage.var"#11#12"(), DataFrames.DataFrame]),
    Dagger.ThunkFailedException{RemoteException}(
        Thunk[6](_file_load, Any["/path/to/file.csv", MyPackage.var"#11#12"(), DataFrames.DataFrame]),
        Thunk[6](_file_load, Any["/path/to/file.csv", MyPackage.var"#11#12"(), DataFrames.DataFrame]),
        RemoteException(
            2,
            CapturedException(
                ErrorException("jl_set_task_tid == 0"),
                Any[
                    (error at error.jl:35, 1),
                    (#execute!#40 at processor.jl:172, 1),
                    (execute! at processor.jl:162, 1),
                    (#170 at Sch.jl:1573 [inlined], 1),
                    (#21 at options.jl:17 [inlined], 1),
                    (#1 at ScopedValues.jl:163, 1),
                    (with_logstate at logging.jl:514, 1),
                    (with_logger at logging.jl:626 [inlined], 1),
                    (enter_scope at payloadlogger.jl:17 [inlined], 1),
                    (with at ScopedValues.jl:162, 1),
                    (with_options at options.jl:16, 1),
                    (do_task at Sch.jl:1571, 1),
                    (macro expansion at Sch.jl:1264 [inlined], 1),
                    (#143 at task.jl:134, 1)
                ]
            )
        )
    )
)
From worker 2:    [ 2023-11-23T15:24:36.745 ] pid: 524 proc: 2 Info:  Caught error
From worker 2:    [ 2023-11-23T15:24:36.939 ] pid: 524 proc: 2 Info:  Base.StackTraces.StackFrame[
    fetch(t::Dagger.ThunkFuture; proc::Dagger.OSProc, raw::Bool) at eager_thunk.jl:16,
    fetch at eager_thunk.jl:11 [inlined],
    #fetch#73 at eager_thunk.jl:58 [inlined],
    fetch at eager_thunk.jl:54 [inlined],
    _broadcast_getindex_evalf at broadcast.jl:683 [inlined],
    _broadcast_getindex at broadcast.jl:656 [inlined],
    getindex at broadcast.jl:610 [inlined],
    copy at broadcast.jl:912 [inlined],
    materialize at broadcast.jl:873 [inlined],
    chunk_lengths(table::DTables.DTable) at dtable.jl:260,
    length(table::DTables.DTable) at dtable.jl:265,
    ⋮, # My code that ends up calling `length(::DTable)`
    invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}) at essentials.jl:816,
    invokelatest(::Any, ::Any, ::Vararg{Any}) at essentials.jl:813,
    (::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}})() at process_messages.jl:285,
    run_work_thunk(thunk::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}}, print_error::Bool) at process_messages.jl:70,
    macro expansion at process_messages.jl:285 [inlined],
    (::Distributed.var"#109#111"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})() at task.jl:514
]

Caught error
      From worker 2:    │   ex =
      From worker 2:    │    ThunkFailedException:
      From worker 2:    │      Root Exception Type: RemoteException
      From worker 2:    │      Root Exception:
      From worker 2:    │    jl_set_task_tid == 0
      From worker 2:    │    Stacktrace:
      From worker 2:    │      [1] error
      From worker 2:    │        @ ./error.jl:35
      From worker 2:    │      [2] #execute!#40
      From worker 2:    │        @ ~/.julia/packages/Dagger/rEYiB/src/processor.jl:172
      From worker 2:    │      [3] execute!
      From worker 2:    │        @ ~/.julia/packages/Dagger/rEYiB/src/processor.jl:162
      From worker 2:    │      [4] #170
      From worker 2:    │        @ ~/.julia/packages/Dagger/rEYiB/src/sch/Sch.jl:1573 [inlined]
      From worker 2:    │      [5] #21
      From worker 2:    │        @ ~/.julia/packages/Dagger/rEYiB/src/options.jl:17 [inlined]
      From worker 2:    │      [6] #1
      From worker 2:    │        @ ~/.julia/packages/ScopedValues/Kvcrb/src/ScopedValues.jl:163
      From worker 2:    │      [7] with_logstate
      From worker 2:    │        @ ./logging.jl:514
      From worker 2:    │      [8] with_logger
      From worker 2:    │        @ ./logging.jl:626 [inlined]
      From worker 2:    │      [9] enter_scope
      From worker 2:    │        @ ~/.julia/packages/ScopedValues/Kvcrb/src/payloadlogger.jl:17 [inlined]
      From worker 2:    │     [10] with
      From worker 2:    │        @ ~/.julia/packages/ScopedValues/Kvcrb/src/ScopedValues.jl:162
      From worker 2:    │     [11] with_options
      From worker 2:    │        @ ~/.julia/packages/Dagger/rEYiB/src/options.jl:16
      From worker 2:    │     [12] do_task
      From worker 2:    │        @ ~/.julia/packages/Dagger/rEYiB/src/sch/Sch.jl:1571
      From worker 2:    │     [13] macro expansion
      From worker 2:    │        @ ~/.julia/packages/Dagger/rEYiB/src/sch/Sch.jl:1264 [inlined]
      From worker 2:    │     [14] #143
      From worker 2:    │        @ ./task.jl:134
      From worker 2:    │      Root Thunk:  Thunk(id=6, _file_load(/path/to/file.csv, #11, DataFrames.DataFrame))
      From worker 2:    │      Inner Thunk: Thunk(id=7, #14(Thunk[6](_file_load, Any["/path/to/file.csv", MyPackage.var"#11#12"(), DataFrames.DataFrame])))
      From worker 2:    └      This Thunk:  Thunk(id=7, #14(Thunk[6](_file_load, Any["/path/to/file.csv", MyPackage.var"#11#12"(), DataFrames.DataFrame])))

I am in the process of trying to more fully mimic the setup of the other machine (different Julia patch version, more Julia processes, different --heap-size-hint) to see if I can reproduce the error.

Do you have any idea what would cause the jl_set_task_tid == 0 error?

StevenWhitaker commented 8 months ago

In case this info helps, sometimes the jl_set_task_tid == 0 error occurs on a different worker (e.g., worker 4 instead of worker 2).

StevenWhitaker commented 8 months ago

Here is more info about the other machine:

NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
ID="rhel"
ID_LIKE="fedora"

My laptop runs Ubuntu 22.04 on WSL 2.

jpsamaroo commented 8 months ago

Seems like it might be that I'm not understanding how that function should be used - I'm gathering details on Slack and will try to push a fix today.

StevenWhitaker commented 8 months ago

@jpsamaroo Any updates?

jpsamaroo commented 8 months ago

https://github.com/JuliaParallel/Dagger.jl/pull/462 should resolve this, potentially with a small performance impact if we keep failing to set the task TID (since it will give up and just the task run wherever Julia's scheduler decides to place it).

StevenWhitaker commented 8 months ago

This is great news! I'll try it out and let you know if the issue is resolved.

StevenWhitaker commented 8 months ago

@jpsamaroo it appears your PR resolved the issue I was seeing on the other Linux machine. Thanks!