JuliaParallel / Dagger.jl

A framework for out-of-core and parallel execution
Other
610 stars 66 forks source link

Streaming branch fixes #496

Closed JamesWrigley closed 3 months ago

JamesWrigley commented 3 months ago

This fixes some minor-ish things I came across when looking into the tests. I don't know if it'll fix the timeouts in CI, for me all the tests pass locally.

JamesWrigley commented 3 months ago

Hmm, ok the tests aren't hanging, they're just hideously slow.

JamesWrigley commented 3 months ago

So right now I think the tests are failing in three(-ish) ways:

JamesWrigley commented 3 months ago

Even going back to the passing tests it looks like the pibots have always taken ~3-4x longer than other nodes: https://buildkite.com/julialang/dagger-dot-jl/builds/944 So either the tests slowed down a lot recently, or the nodes are slower/under heavier load. This is very anecdotal but I'd guess that it's our tests that've slowed down from comparing the times on the armageddon nodes from these jobs: https://buildkite.com/julialang/dagger-dot-jl/builds/972 (February) and https://buildkite.com/julialang/dagger-dot-jl/builds/1039 (this PR)

The tests went from ~11 minutes to ~20 minutes (I am assuming the armageddon nodes weren't swapped/had their loads changed).

jpsamaroo commented 3 months ago

All of this looks good aside from the revert of https://github.com/JuliaParallel/Dagger.jl/pull/467 (which should be a strictly beneficial change - I have no idea why this would change precompile behavior). The purpose of that assertion is to ensure that all Dagger tasks finish during precompile, as Julia itself will hang or die when trying to finish precompile with tasks still running in the background. So something is either still keeping tasks alive, or this is just spurious and we need to wait a bit longer for Dagger to clean things up.

JamesWrigley commented 3 months ago

Yeah makes sense, I've removed the reverting commit so the PR can be merged. I think something is still keeping the tasks alive because that test is consistently failing since #467 (albeit I haven't been able to reproduce it locally yet). I did wonder if we were hitting https://github.com/JuliaLang/julia/issues/40626 since it's the only multithreaded test, but I tried it with a single thread and it still failed :shrug:

JamesWrigley commented 3 months ago

Hmm, it seems to be something thread-related but I can't grok how. Notes:

JamesWrigley commented 3 months ago

Well, that was a rabbit hole... @jpsamaroo I went a bit further than we discussed and ended up moving all the dynamic worker support exclusively to the eager API in 9aa8eee because that seemed cleaner than maintaining both. I don't know if that's too breaking though? I also fixed a couple of other bugs along the way. The tests pass locally for me, let's see what CI thinks :crossed_fingers:

JamesWrigley commented 3 months ago

TL;DR:

I'll come back to these failures another time, I think the PR can be merged now.

jpsamaroo commented 3 months ago

Thanks again!