CI sometimes hangs - Githubissues

JoeyBF commented 1 year ago

~~See for example the CI pass for 3299940, which timed out.~~ It seems to be a deadlock related to iter_s_t, but we haven't changed it in a while. Maybe some dependency introduced it recently.

Edit: It looks like that one was on my fork, but the CI for #141 is currently hanging

JoeyBF commented 1 year ago

I found the bug. I'm not sure why the issue never popped up before, but I guess it might be a recent update to rayon. Here's how the deadlock happens.

Start in ResolutionHomomorphism::extend_through_stem.
The method will use iter_s_t to call ResolutionHomomorphism::extend_step_raw concurrently for every bidegree.
Each of those will eventually call FullModuleHomomorphism::compute_auxiliary_data_through_degree, which will call OnceBiVec::extend on the kernels attribute of the module.
Any threads that happen to be working on the same homological degree (from iter_s_t) will try to access that module, and in particular call extend on that OnceBiVec. Since OnceBiVecs are concurrent, the first thread to call extend will lock it while it does the writing. Call this thread "A".
The closure that extend executes calls ModuleHomomorphism::auxiliary_data, which then calls ModuleHomomorphism::get_matrix.
get_matrix does its computation using Matrix::par_iter_mut.
Thread A chugs along and finishes the parallel iteration over the matrix, but notices that other threads aren't done. Being part of the rayon threadpool, it sets out to steal some work in the meantime.
Thread A starts a job spawned by iter_s_t, that computes a different bidegree but on the same homological degree.
At this point we're doomed. It will eventually try to take a lock on the same kernels attribute, that it is itself holding higher up in the call stack. Thread A hangs forever.

I think the way out is using reentrant mutexes for OnceVec. I'll experiment with that

JoeyBF commented 1 year ago

Turns out that reentrant mutexes won't work. By design they can't give us a mutable reference, because then a thread locking the same lock twice would have two mutable references.

The only option that I see, short of implementing some sort of prioritization of tasks that would be internal to rayon (which would also take care of #105), would be revising the implementation of OnceVec so that it becomes lock-free.

SpectralSequences / sseq

CI sometimes hangs #142