JuliaMolSim / DFTK.jl

Density-functional toolkit
https://docs.dftk.org
MIT License
412 stars 84 forks source link

Threading + MPI #974

Open antoine-levitt opened 1 month ago

antoine-levitt commented 1 month ago

I've had this happen when running DFTK from within threads. I'm not too clear on what we should do here.

ERROR: LoadError: TaskFailedException

    nested task error: UndefRefError: access to undefined reference
    Stacktrace:
      [1] getindex
        @ ./essentials.jl:892 [inlined]
      [2] popfirst!
        @ ./array.jl:1706 [inlined]
      [3] run_init_hooks()
        @ MPI ~/.julia/packages/MPI/rwDDn/src/environment.jl:65
      [4] Init(; threadlevel::Symbol, finalize_atexit::Bool, errors_return::Bool)
        @ MPI ~/.julia/packages/MPI/rwDDn/src/environment.jl:155
      [5] Init
        @ ~/.julia/packages/MPI/rwDDn/src/environment.jl:114 [inlined]
      [6] PlaneWaveBasis(model::Model{…}, Ecut::Float64, fft_size::Tuple{…}, variational::Bool, kgrid::MonkhorstPack, symmetries_respect_rgrid::Bool, use_symmetries_for_kpoint_reduction::Bool, comm_kpts::MPI.Comm, architecture::DFTK.CPU)
        @ DFTK ~/.julia/dev/DFTK/src/PlaneWaveBasis.jl:247
      [7] #PlaneWaveBasis#141
        @ ~/.julia/dev/DFTK/src/PlaneWaveBasis.jl:399 [inlined]
      [8] setup_calculation(s::Int64, n_electrons::Int64, b::Int64, α::Int64; scaling::Symbol, α_q::Int64, α_r::Int64)
        @ Main ~/Dropbox/recherche/2020-11-anyons/new/functions.jl:239
      [9] setup_calculation
        @ ~/Dropbox/recherche/2020-11-anyons/new/functions.jl:207 [inlined]
     [10] 
        @ Main ~/Dropbox/recherche/2020-11-anyons/new/functions.jl:244
     [11] macro expansion
        @ ~/Dropbox/recherche/2020-11-anyons/new/compute.jl:25 [inlined]
     [12] (::var"#33#threadsfor_fun#23"{Int64, Int64, String, Channel{Int64}})(tid::Int64)
        @ Main ./threadingconstructs.jl:209
     [13] (::Base.Threads.var"#1#2"{var"#33#threadsfor_fun#23"{Int64, Int64, String, Channel{Int64}}, Int64})()
        @ Base.Threads ./threadingconstructs.jl:154
    Some type information was truncated. Use `show(err)` to see complete types.

...and 5 more exceptions.

Stacktrace:
 [1] threading_run(fun::var"#33#threadsfor_fun#23"{Int64, Int64, String, Channel{Int64}}, static::Bool)
   @ Base.Threads ./threadingconstructs.jl:172
 [2] macro expansion
   @ ./threadingconstructs.jl:189 [inlined]
 [3] top-level scope
   @ ~/Dropbox/recherche/2020-11-anyons/new/compute.jl:21
epolack commented 1 month ago

I remember being able to do launch it in a quick and dirty way, but I am not so sure anymore…

On a local branch I enabled switching off the three parts where Threads is used.

antoine-levitt commented 1 month ago

It works most of the times but I just had this happen once. Switching off you mean this? https://github.com/JuliaMolSim/DFTK.jl/pull/972

epolack commented 1 month ago

Right now, for me it works none of the time on another stuff I am doing…

Yes, I was indeed looking at 972 and looks like a lot what I am using for parallel phonons.

(I think I gave up looking at how to do thread in thread because of the @timing stuff.)

antoine-levitt commented 1 month ago

(I think I gave up looking at how to do thread in thread because of the @timing stuff.)

Yeah, should we just disable this by default?

epolack commented 1 month ago

I have never used the fact that it's enabled by default. I've always found this surprising.

mfherbst commented 1 month ago

I've had this happen when running DFTK from within threads.

I think this is because MPI is initialised twice. We should put the initialisation call around a semaphore or signal MPI in the way we initialise it that it could be called from multiple threads (I think it has a flag to do that).