Closed JamesWrigley closed 3 months ago
Hmm, ok the tests aren't hanging, they're just hideously slow.
So right now I think the tests are failing in three(-ish) ways:
Some tests are slow and are timing out, and at this point I'm kinda suspecting something's wrong with the CI machines. This is the last job from #463: https://buildkite.com/julialang/dagger-dot-jl/builds/1035
Julia 1.9 and nightly timeout on pibot
nodes.
This is a failing job from my first push to this branch: https://buildkite.com/julialang/dagger-dot-jl/builds/1036
Again 1.9 and nightly timeout on pibot
nodes.
And this is the current job, which is still running as I write this: https://buildkite.com/julialang/dagger-dot-jl/builds/1039
This time 1.9 and nightly are on armageddon
nodes, but 1.10 is on a pibot
and it looks like it's going to timeout.
Modify workers in running job
is properly borked, still looking into that.Even going back to the passing tests it looks like the pibots have always taken ~3-4x longer than other nodes: https://buildkite.com/julialang/dagger-dot-jl/builds/944 So either the tests slowed down a lot recently, or the nodes are slower/under heavier load. This is very anecdotal but I'd guess that it's our tests that've slowed down from comparing the times on the armageddon nodes from these jobs: https://buildkite.com/julialang/dagger-dot-jl/builds/972 (February) and https://buildkite.com/julialang/dagger-dot-jl/builds/1039 (this PR)
The tests went from ~11 minutes to ~20 minutes (I am assuming the armageddon nodes weren't swapped/had their loads changed).
All of this looks good aside from the revert of https://github.com/JuliaParallel/Dagger.jl/pull/467 (which should be a strictly beneficial change - I have no idea why this would change precompile behavior). The purpose of that assertion is to ensure that all Dagger tasks finish during precompile, as Julia itself will hang or die when trying to finish precompile with tasks still running in the background. So something is either still keeping tasks alive, or this is just spurious and we need to wait a bit longer for Dagger to clean things up.
Yeah makes sense, I've removed the reverting commit so the PR can be merged. I think something is still keeping the tasks alive because that test is consistently failing since #467 (albeit I haven't been able to reproduce it locally yet). I did wonder if we were hitting https://github.com/JuliaLang/julia/issues/40626 since it's the only multithreaded test, but I tried it with a single thread and it still failed :shrug:
- The DTables.jl test fails because Dagger fails to precompile because this assertion is reached: https://github.com/JuliaParallel/Dagger.jl/blob/master/src/precompile.jl#L20. I've fixed that by reverting Sch: Disable CHUNK_CACHE evictions for now #467, which according to the CI history is the one that broke the test, but in light of Sch: Temporarily disable chunk cache #465 I'm not sure it's the right thing to do 🤔
Hmm, it seems to be something thread-related but I can't grok how. Notes:
evict_all_chunks!()
. I can replace the inner remotecall to evict_chunks!()
with one to identity()
and the tests will still pass.precompile.jl
does nothing, and the issue is deterministic, so I don't think it's as simple as a race condition.Well, that was a rabbit hole... @jpsamaroo I went a bit further than we discussed and ended up moving all the dynamic worker support exclusively to the eager API in 9aa8eee because that seemed cleaner than maintaining both. I don't know if that's too breaking though? I also fixed a couple of other bugs along the way. The tests pass locally for me, let's see what CI thinks :crossed_fingers:
TL;DR:
addprocs(; lazy=false)
in an attempt to fix the peer-connection errors, that did fix them, but the OSX tests seemed to hang at the scheduler tests.I'll come back to these failures another time, I think the PR can be merged now.
Thanks again!
This fixes some minor-ish things I came across when looking into the tests. I don't know if it'll fix the timeouts in CI, for me all the tests pass locally.