Open StevenWhitaker opened 1 year ago
@StevenWhitaker can you try reproducing these again on Dagger master
?
Thanks for getting a patch released!
The issues are different now, so that's something ;)
Now I observe the following behavior (EDIT: when running Julia with multiple threads):
sums
(not sure if it was in map
or reduce
or fetch
).sums
:
Unhandled Task ERROR: ArgumentError: destination has fewer elements than required
Stacktrace:
[1] copyto!(dest::Vector{Dagger.Sch.ProcessorState}, src::Base.ValueIterator{Dict{Dagger.Processor, Dagger.Sch.ProcessorState}})
@ Base ./abstractarray.jl:949
[2] _collect
@ ./array.jl:713 [inlined]
[3] collect
@ ./array.jl:707 [inlined]
[4] macro expansion
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1189 [inlined]
[5] (::Dagger.Sch.var"#128#135"{Dagger.Sch.ProcessorInternalState, UInt64, RemoteChannel{Channel{Any}}, Dagger.ThreadProc})()
@ Dagger.Sch ./task.jl:134
After this error, most of the time it hangs, sometimes it runs to completion.
I realized that I start Julia with multiple threads by default, so I also ran the code with a single thread (julia --project -t1
). In this case, I saw the Unhandled Task ERROR
once (incidentally, the first time), and every time I ran the code (including the first time) it ran to completion.
So, besides the one sporadic error, this issue seems to be addressed, assuming the issues I observed with multiple threads are due to the interplay between Distributed
and Threads
.
Edit to my previous comment:
I'm running my actual code with a single thread now, and it also hangs, so there might be something else still at play.
I can reproduce the hangs - I'll keep investigating! Thanks for your patience :slightly_smiling_face:
Running through your example with Dagger's logging enabled, I find that we spend a good bit of time (about 0.3-0.5 s for me) in the reduce
calls at the end, which are running in serial over 233K keys - at this pace, I can see why it looks like it's hanging :laughing:
A large portion of the time is spent in the GC (about 40% time over ~80K allocations totaling ~500MB), so I suspect allocations are what's killing performance. If I can figure out how to reduce those allocations, it would also be reasonable to parallelize the reduce
calls (by doing two map
s, one to launch a task per key, and one to fetch the results), and that should give us much better runtimes.
Additionally, the other calls that took a while are select
and groupby
, so we could probably look into improving those a bit.
EDIT: Those timings and allocations are so high because of logging - they drop significantly when logging is disabled, although then I see a ton of long-lived allocations that threaten to crash Julia. I still need to see if some of those allocations can be reduced.
EDIT 2: Silly me, these reductions are already asynchronous :smile: I guess the task completes before we return from reduce
anyway, since we're only running with 1 thread.
Ok, something that I would recommend is, instead of the map
-> reduce
pattern, just use a single reduce
call: reduce(+, gdt; cols=Symbol.(names(df)[[93,94]]))
. This appears to be much more memory and time efficient, which makes sense because it can internally do more optimizations (it already knows that you intend to reduce over each key in the group).
Can you test that and confirm whether it speeds your script up sufficiently for it to complete in a reasonable amount of time?
Thanks for the tip. I tried it out on my actual project (not the exact example in the OP), and it does seem to help, but I still see the code hang occasionally. I'm pretty sure it's not just taking forever, because when the code does complete, it doesn't take that long, and when it hangs the cpu utilization drops to 0.
It actually seems to be the case that my code hangs only when calling my main function again after a successful run. Or at least the chances of hanging are higher in that case. I'm not really sure why that would be the case, though.
I also saw a new error (when calling fetch
on a DTable
, with Dagger v0.18.4 and DTables v0.4.2):
Dagger.ThunkFailedException{Dagger.ThunkFailedException{CapturedException}}(Thunk[3](isnonempty, Any[Thunk[2](_file_load, Any["path/to/file.csv", NRBS.var"#1#2"(), DataFrames.DataFrame])]), Thunk[2](_file_load, Any["path/to/file.csv", NRBS.var"#1#2"(), DataFrames.DataFrame]), Dagger.ThunkFailedException{CapturedException}(Thunk[2](_file_load, Any["path/to/file.csv", NRBS.var"#1#2"(), DataFrames.DataFrame]), Thunk[2](_file_load, Any["path/to/file.csv", NRBS.var"#1#2"(), DataFrames.DataFrame]), CapturedException(UndefRefError(), Any[(getindex at essentials.jl:13 [inlined], 1), (get! at dict.jl:465, 1), (OSProc at processor.jl:109 [inlined], 2), (do_task at Sch.jl:1368, 1), (macro expansion at Sch.jl:1243 [inlined], 1), (#132 at task.jl:134, 1)])))
It looks like it has to do with file loading, so this is the code I use to load .csv files:
DTable(x -> CSV.File(x), [filepath]; tabletype = DataFrame)
I only saw the error once, though.
And another one-time error (in the function with the reduce
call):
From worker 4: ┌ 2023-10-24T13:00:07.238 ] pid: 20516 proc: 4 Error: Error on 4 while connecting to peer 3, exiting
From worker 4: │ exception =
From worker 4: │ ConcurrencyViolationError("lock must be held")
From worker 4: │ Stacktrace:
From worker 4: │ [1] concurrency_violation()
From worker 4: │ @ Base ./condition.jl:8
From worker 4: │ [2] assert_havelock
From worker 4: │ @ ./condition.jl:25 [inlined]
From worker 4: │ [3] assert_havelock
From worker 4: │ @ ./condition.jl:48 [inlined]
From worker 4: │ [4] assert_havelock
From worker 4: │ @ ./condition.jl:72 [inlined]
From worker 4: │ [5] notify(c::Condition, arg::Any, all::Bool, error::Bool)
From worker 4: │ @ Base ./condition.jl:150
From worker 4: │ [6] #notify#622
From worker 4: │ @ ./condition.jl:148 [inlined]
From worker 4: │ [7] notify (repeats 2 times)
From worker 4: │ @ ./condition.jl:148 [inlined]
From worker 4: │ [8] set_worker_state
From worker 4: │ @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:148 [inlined]
From worker 4: │ [9] Distributed.Worker(id::Int, r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, manager::Distributed.DefaultClusterManager; version::Nothing, config::WorkerConfig)
From worker 4: │ @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:126
From worker 4: │ [10] Worker
From worker 4: │ @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:116 [inlined]
From worker 4: │ [11] connect_to_peer(manager::Distributed.DefaultClusterManager, rpid::Int, wconfig::WorkerConfig)
From worker 4: │ @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:363
From worker 4: │ [12] (::Distributed.var"#121#123"{Int, WorkerConfig})()
From worker 4: │ @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:349
From worker 4: │ [13] exec_conn_func(w::Distributed.Worker)
From worker 4: │ @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:181
From worker 4: │ [14] (::Distributed.var"#21#24"{Distributed.Worker})()
From worker 4: └ @ Distributed ./task.jl:514
The above errors occurred when calling my main function the first time.
I tried to create a MWE that was closer to the actual workflow I'm working with. I'm guessing the errors occurring here are related to #437 (one of the four reported errors below is the same as the linked issue). I hope this is helpful and not just extra noise!
Contents of
mwe.jl
:I
include
dmwe.jl
in a fresh Julia session multiple times (meaning eachinclude
occurred in its own fresh Julia session) and recorded the following errors. Note that nothing changed inmwe.jl
from run to run.Error 1:
Error 2:
Error 3:
Error 3b: Occasionally the segfault was preceded by one or more occurrences of:
Error 4:
Comments:
MethodError
withconvert
(error 1). I most commonly run into the error mentioned in https://github.com/JuliaParallel/Dagger.jl/issues/437#issuecomment-1739631443, which I did not see withmwe.jl
."file.csv"
is a 157 MB table with 233930 rows and 102 columns ofString
andFloat64
values.remotecall
probably isn't necessary for reproducing the bugs, but I included it because that is how my actual work is.