mrava87 commented 3 weeks ago

Problem

There may be scenarios where the input array that is broadcasted across ranks has a size that exceeds 2GB. This line https://github.com/PyLops/pylops-mpi/blob/57b793e8ce4c150d90866d1e41c0bd9e88cae985/pylops_mpi/DistributedArray.py#L137

will generate an error of the kind in https://github.com/mpi4py/mpi4py/issues/119

Solution

One option is to look into what suggested in the linked issue. However, I am wondering if (and when) we really need to do broadcasting... for example, when we define a MPILinearOperator that wraps a pylops operator, we know that the input at each rank will be the same, and each rank will perform the same operation, so could we avoid all together to do this: https://github.com/PyLops/pylops-mpi/blob/57b793e8ce4c150d90866d1e41c0bd9e88cae985/pylops_mpi/LinearOperator.py#L86

This will also reduce a lot of (perhaps useless) communication time?

hongyx11 commented 3 weeks ago

There are two other solutions, one is plain MPI send and Recv as stated by Lisandro https://github.com/mpi4py/mpi4py/issues/119#issuecomment-945390731 Second, is mpi one sided API, https://mpi4py.readthedocs.io/en/stable/tutorial.html#one-sided-communication-rma I recently benchmark both collective API and one sided API, but i use all_to_allv and passive window in onesided API, It turns out after a certain threshold, collective API performance drops dramatically, and one sided API keeps similar. one sided API will be slower than collective API in general, but if perforance is not critical in some functions, it will be a good choice.

roofline_benchmark

mrava87 commented 3 weeks ago

Thanks @hongyx11, interesting 😀

Two questions:

I dug a bit deeper and realized that the 2GB problem is solved with MPI4.0 (and MPI4Py>=4.0.0 as they started supporting MPI4.0). In fact, in one cluster I have mpi4py==4.0.0 and I don’t experience this problem, in the other cluster a older version of mpi4py is installed and that’s why I started to far this issue. Now the question is, do you know if the 2GB limit has been pushed up to XGB or if now there is no limit as MPI4.0 handles communication differently internally?
to the one-sided MPI. I read a bit about but I’m not super familiar; would it be straightforward to use it in pylops-MPI or we would need to reconsider completely the design of the library?

mrava87 commented 3 weeks ago

Btw, @rohanbabbar04, my main question remain valid - there is nothing better than NOT communicate if this is not needed and I wonder if we are actually over-communicating when we have a partition.BROADCAST array? The only reason I see for doing this bcast operation is that we want to ensure that if the array became different across ranks the one of rank0 overwrites the others - perhaps we could create something new like partition.UNSAFEBROADCAST that does not do the bcast operation at the risk of having arrays different from each other. Users would switch to it after they have ensured their code works for large runs where you want to be as efficient as possible?

hongyx11 commented 3 weeks ago

Thanks @hongyx11, interesting 😀

Two questions:

I dug a bit deeper and realized that the 2GB problem is solved with MPI4.0 (and MPI4Py>=4.0.0 as they started supporting MPI4.0). In fact, in one cluster I have mpi4py==4.0.0 and I don’t experience this problem, in the other cluster a older version of mpi4py is installed and that’s why I started to far this issue. Now the question is, do you know if the 2GB limit has been pushed up to XGB or if now there is no limit as MPI4.0 handles communication differently internally?

to the one-sided MPI. I read a bit about but I’m not super familiar; would it be straightforward to use it in pylops-MPI or we would need to reconsider completely the design of the library?

It seems to me that MPI 4.0 has not solved the int problem. But my impression might be wrong. could you please provide more details about the experiment? Indeed, moving to one sided mpi will increase the complexity. So I think send and recv is more reasonable. but the best is of course no communication if it not needed. Yuxi

mrava87 commented 3 weeks ago

I can try to reproduce the example, but basically I was running the same code in Ibex and Shahee where, because of this broadcasting operation I am talking about, a large vector was sent across nodes. In one case things worked fine, in the other I started to get the error OverflowError: integer 2176051861 does not fit in 'int'... after mpi4py was updated to >=4.0.1 the problem was gone.

Reading the comment of Lisandro here https://github.com/mpi4py/mpi4py/issues/119

This happens when the total memory of the pickled object is larger than 2GB, the message count overflows a 32bit int. This is a well know limitation of MPI-1/2/3, hopefully fixed on MPI-4, but not yet available on a released Open MPI or MPICH (in mpi4py, there is already a branch with MPI-4 support).

I had assumed this problem is fixed in MPI-4? But maybe the allowed size is just now bigger but there is still a limit?

mrava87 commented 3 weeks ago

For sure, one thing we may want to look at is mpi4py.util.pkl5 as our library is bound to work with large vector 😸

I'll give it a try and report my findings

rohanbabbar04 commented 2 weeks ago

Hmm... Sorry for jumping in a little late.

I agree with @mrava87 that eliminating communication altogether should work. I'm encountering the 2GB error with mpi4py<4.0.0. Since we can get the correct result without using bcast (just sending and receiving), the best approach might be to assign the local arrays directly at each rank, as @mrava87 suggested. This would avoid communication overhead and could work effectively. Do you think we should pursue both the UNSAFE and SAFE approaches? I think we can test this by assigning the same local array at each rank for the broadcast. If we are able to bypass the 2GB error, then we should be good.

I have one question that just came to mind:

We use global sum reductions in some cases, which will require the incorporation of all processes. I think this might not be efficient in the case of One-Sided MPI.

One option would be to integrate it as an additional requirement without recommending it as the default. What do you all think?

mrava87 commented 2 weeks ago

Hi @rohanbabbar04, Thanks for the feedback.

I have not yet tried getting rid of the bcast line but I’m happy you also think it should work as long as we assign the same array in all ranks at the start and we are consistent through the run.

I wasn’t sure why in the first place we did it this way and I seemed to remember the idea was there we want to prevent users to accidentally overwrite an array that should be broadcasted in one rank and not do the same in the others… so the idea of having an unsafe and a safe option, but we could change and just have the current broadcast that does not enforce anything so we reduce communication (but give users more power to mess us 🙈)

I also don’t think we should switch to one-sides MPI… perhaps just add the possibility to use it if deemed a better fit in some scenarios- but I’m not familiar with it so I would not be sure when and where to use it

mrava87 commented 2 weeks ago

For sure, one thing we may want to look at is mpi4py.util.pkl5 as our library is bound to work with large vector 😸

I'll give it a try and report my findings

I tried this, but I seem to still have issues with large arrays, so probably the 2GB thing is not fully gone in MPI4.0

rohanbabbar04 commented 2 weeks ago

Hi @rohanbabbar04, Thanks for the feedback.

I have not yet tried getting rid of the bcast line but I’m happy you also think it should work as long as we assign the same array in all ranks at the start and we are consistent through the run.

I wasn’t sure why in the first place we did it this way and I seemed to remember the idea was there we want to prevent users to accidentally overwrite an array that should be broadcasted in one rank and not do the same in the others… so the idea of having an unsafe and a safe option, but we could change and just have the current broadcast that does not enforce anything so we reduce communication (but give users more power to mess us 🙈)

I also don’t think we should switch to one-sides MPI… perhaps just add the possibility to use it if deemed a better fit in some scenarios- but I’m not familiar with it so I would not be sure when and where to use it

Let me add a PR for this change for assigning it directly to the local array, I will test it

PyLops / pylops-mpi

Overflow in scatter/broadcast for arrays larger than 2GB #115

Problem

Solution