[BUG] Threaded_blocksparse incompatible with MPI when using DMRG with QNs

nbaldelli commented 1 year ago

If one tries to activate threaded_blocksparse and MPI parallelization together to perform DMRG calculation using number conserving sites (doesn't matter the SiteType) the following error is returned during the exact diagonalization step (in this particular case the code ran was this example)

ERROR: LoadError: BoundsError: attempt to access 4-element Vector{Pair{QN, Int64}} at index [5]
Stacktrace:
  [1] getindex
    @ ./array.jl:861 [inlined]
  [2] getindex
    @ ./abstractarray.jl:1221 [inlined]
  [3] blockdim
    @ ~/.julia/packages/ITensors/5dcHw/src/qn/qnindex.jl:15 [inlined]
  [4] blockdim
    @ ~/.julia/packages/ITensors/5dcHw/src/qn/qnindex.jl:256 [inlined]
  [5] blockdim
    @ ~/.julia/packages/ITensors/5dcHw/src/qn/qnindex.jl:273 [inlined]
  [6] blockdim
    @ ~/.julia/packages/NDTensors/lbVmG/src/blocksparse/blockdims.jl:131 [inlined]
  [7] #124
    @ ~/.julia/packages/NDTensors/lbVmG/src/blocksparse/blockdims.jl:140 [inlined]
  [8] macro expansion
    @ ./ntuple.jl:74 [inlined]
  [9] ntuple
    @ ./ntuple.jl:69 [inlined]
 [10] blockdims
    @ ~/.julia/packages/NDTensors/lbVmG/src/blocksparse/blockdims.jl:140 [inlined]
 [11] blockdim
    @ ~/.julia/packages/NDTensors/lbVmG/src/blocksparse/blockdims.jl:149 [inlined]
 [12] blockoffsets(blocks::Vector{Block{4}}, inds::NTuple{4, Index{Vector{Pair{QN, Int64}}}})
    @ NDTensors ~/.julia/packages/NDTensors/lbVmG/src/blocksparse/blockoffsets.jl:71
 [13] (NDTensors.BlockSparseTensor)(#unused#::Type{Float64}, blocks::Vector{Block{4}}, inds::NTuple{4, Index{Vector{Pair{QN, Int64}}}})
    @ NDTensors ~/.julia/packages/NDTensors/lbVmG/src/blocksparse/blocksparsetensor.jl:97
 [14] _Allreduce(#unused#::Type{NDTensors.BlockSparse{Float64, Vector{Float64}, 4}}, sendbuf::ITensor, op::Function, comm::MPI.Comm)
    @ ITensorParallel ~/.julia/packages/ITensorParallel/63YWQ/src/mpi_projmposum.jl:44
 [15] _Allreduce(sendbuf::ITensor, op::Function, comm::MPI.Comm)
    @ ITensorParallel ~/.julia/packages/ITensorParallel/63YWQ/src/mpi_projmposum.jl:22
 [16] eigsolve(A::MPISum{ProjMPO}, x₀::ITensor, howmany::Int64, which::Symbol, alg::KrylovKit.Lanczos{KrylovKit.ModifiedGramSchmidt2, Float64})
    @ KrylovKit ~/.julia/packages/KrylovKit/kWdb6/src/eigsolve/lanczos.jl:11

It looks like the error slightly changes by changing the dimension of the MPS. In the previous example I used a 8x2 lattice while if I use a 4x2 lattice I get

ERROR: LoadError: MPIError(15): MPI_ERR_TRUNCATE: message truncated
Stacktrace:
 [1] MPI_Allreduce
   @ ~/.julia/packages/MPI/tJjHF/src/api/generated_api.jl:288 [inlined]
 [2] Allreduce!(rbuf::MPI.RBuffer{Vector{Float64}, Vector{Float64}}, op::MPI.Op, comm::MPI.Comm)
   @ MPI ~/.julia/packages/MPI/tJjHF/src/collective.jl:653
 [3] _Allreduce(sendbuf::ITensor, op::Function, comm::MPI.Comm)
   @ ITensorParallel ~/.julia/packages/ITensorParallel/63YWQ/src/mpi_projmposum.jl:22
 [4] eigsolve(A::MPISum{ProjMPO}, x₀::ITensor, howmany::Int64, which::Symbol, alg::KrylovKit.Lanczos{KrylovKit.ModifiedGramSchmidt2, Float64})
   @ KrylovKit ~/.julia/packages/KrylovKit/kWdb6/src/eigsolve/lanczos.jl:11
in expression starting at /mnt/home/nbaldelli/parheis.jl:61

threaded_blocksparse and MPI work smoothly singularly by deactivating the other and work together by turning off the QNs conservation.

I ran the code using mpirun by using the following bash script:

#!/bin/bash
#SBATCH -N2
#SBATCH --ntasks-per-node 1 
#SBATCH --cpus-per-task 8

module purge
module load slurm
module load openmpi/4
module load julia
julia --project -e 'ENV["JULIA_MPI_BINARY"]="system"; using Pkg; Pkg.build("MPI"; verbose=true)'
mpirun julia -t 8 <name_file_here>

mtfishman commented 1 year ago

Thanks for the report. @b-kloss looks like an issue with the custom _Allreduce function we wrote.

b-kloss commented 1 year ago

The _Allreduce is written such that the addition over contributions from different terms in different nonzero blocks is handled by MPI (and +) and I doubt that one can employ the threading over blocks there. Outside of what the _Allreduce does (i.e. applying the environment tensors to the site tensor), threaded_blocksparse should not pose a problem though. So it could be a matter of handling which thread calls MPI and making sure that all threads are done before calling MPI. For the former, see https://juliaparallel.org/MPI.jl/v0.13/environment.html#MPI.ThreadLevel For the latter, I am not sure how to acquire and release a lock for threading at the threaded_blocksparse level.

mtfishman commented 1 year ago

It doesn't make much sense to me since _Allreduce should be totally independent of the block sparse multithreading, which is only used in tensor contraction. Basically it is organized as follows:

Per MPI process, it performs tensor contractions within the eigensolver (either multithreaded over blocks or not).
Reduce over the result of each process.

Somehow the code is giving different results depending on if 1. uses multithreading or not. It makes me think that the multithreaded contract code is doing something strange and outputting different tensors compared to the non-multithreaded contract code, which are then causing issues for _Allreduce.

b-kloss commented 1 year ago

Yes, I agree. I also tried disabling threaded_blocksparse within the call to product(P::MPIProjMPOSum,v) and enabling it afterwards, and it still gives the BoundsError. 1.) Do we assume that the blocks are of the maximally possible size (allowed by quantum numbers) in the _Allreduce? 2.) Does threaded_blocksparse fragment blocks that are compatible quantum-number-wise into subblocks depending on the number of threads?

mtfishman commented 1 year ago

1.) Do we assume that the blocks are of the maximally possible size (allowed by quantum numbers) in the _Allreduce?

No, most of the logic is analyzing the blocks that actually exist in the tensors and sharing that information across the processes.

2.) Does threaded_blocksparse fragment blocks that are compatible quantum-number-wise into subblocks depending on the number of threads?

No, it only threads over the list of blocks, not within blocks (ideally threading within blocks would get taken care of by BLAS, but currently BLAS and Julia threads don't compose with each other).

b-kloss commented 1 year ago

Here's a hack that fixes the issue for me: Add this function redefinition to the top of your script

function ITensors.NDTensors.contract_blockoffsets(args...)
  #if using_threaded_blocksparse() && nthreads() > 1
  #  return _threaded_contract_blockoffsets(args...)
  #end
  return ITensors.NDTensors._contract_blockoffsets(args...)
end

I don't expect the hack to incur a big performance penalty by using the non-threaded version since it seems to me that the above function only does preparation/logic for the contraction and not the contraction itself, but I may be wrong about this.

Update: Adding an extra sort statement in _threaded_contract_blockoffsets(args...) like

contraction_plan = reduce(vcat, contraction_plans)
sort!(contraction_plan; by=last)

fixes the issue (without the above redefinition). I'll put in a PR for this issue in ITensors.jl after the weekend.

mtfishman commented 1 year ago

@nbaldelli could you test out the latest version of the package which includes #13? That should fix the issue you first reported.

ITensor / ITensorParallel.jl

[BUG] Threaded_blocksparse incompatible with MPI when using DMRG with QNs #11