Closed mtfishman closed 1 year ago
@b-kloss I believe I got everything working with Distributed.jl. Not sure what was going on before, maybe something silly with not keeping the indices or tensors updated properly across processes (i.e. I may not have been broadcasting the results of updating the environments properly).
In the end, all that's needed is the new distributedsum.jl
file which overloads some basic operations on remote objects (Future
objects from Distributed.jl
), and makes sure the operations are being performed remotely on the worker/process where they currently live, using macros calls like @spawnat
and @fetchfrom
where you can specify the worker/process where the operation should be performed to be the one where the term of the sum is currently residing, using term.where
.
I'm curious how the performance compares to the MPI.jl
implementation (I did some simple tests at smaller bond dimensions but nothing systematic), I'm hoping we can just use Distributed.jl
going forward since that would simplify things a lot. It's easier to test, develop, and write simpler code that is generic across parallel vs. sequential execution and where the parallelism is hidden farther down and handled through dispatch.
@emstoudenmire this should give you some idea of how the AbstractSum
interface in https://github.com/ITensor/ITensors.jl/pull/1046 will be used.
MPISum
(EDIT: renamedMPISumTerm
) and fix numerical issues of MPS tensors getting out of sync across threads by overloadingITensors.position!
andITensors.orthogonalize!
to broadcast the new MPS tensors from one process to the rest.This relies on ITensors v0.3.27.