Closed JoeyBF closed 9 months ago
I found the bug. I'm not sure why the issue never popped up before, but I guess it might be a recent update to rayon
. Here's how the deadlock happens.
ResolutionHomomorphism::extend_through_stem
.iter_s_t
to call ResolutionHomomorphism::extend_step_raw
concurrently for every bidegree.FullModuleHomomorphism::compute_auxiliary_data_through_degree
, which will call OnceBiVec::extend
on the kernels
attribute of the module.iter_s_t
) will try to access that module, and in particular call extend
on that OnceBiVec
. Since OnceBiVec
s are concurrent, the first thread to call extend
will lock it while it does the writing. Call this thread "A".extend
executes calls ModuleHomomorphism::auxiliary_data
, which then calls ModuleHomomorphism::get_matrix
.get_matrix
does its computation using Matrix::par_iter_mut
.iter_s_t
, that computes a different bidegree but on the same homological degree.kernels
attribute, that it is itself holding higher up in the call stack. Thread A hangs forever.I think the way out is using reentrant mutexes for OnceVec
. I'll experiment with that
Turns out that reentrant mutexes won't work. By design they can't give us a mutable reference, because then a thread locking the same lock twice would have two mutable references.
The only option that I see, short of implementing some sort of prioritization of tasks that would be internal to rayon (which would also take care of #105), would be revising the implementation of OnceVec
so that it becomes lock-free.
See for example the CI pass for 3299940, which timed out.It seems to be a deadlock related toiter_s_t
, but we haven't changed it in a while. Maybe some dependency introduced it recently.Edit: It looks like that one was on my fork, but the CI for #141 is currently hanging