RadioAstronomySoftwareGroup / pyuvsim

A ultra-high precision package for simulating radio interferometers in python on compute clusters.
https://pyuvsim.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
43 stars 7 forks source link

Pyuvsim profiling tests on Bridges-2 HPC #346

Open kbharatgehlot opened 3 years ago

kbharatgehlot commented 3 years ago

We ran several pyuvsim profiling tests on Bridges-2 to study the scalability of pyuvsim on HPC clusters. The profiling tests aimed to answer the following questions:

  1. Given a simulation of fixed volume (N_bltf = N_bl x N_t x N_f), how does the runtime scale with the number of sources (N_src) used in the simulation?
  2. For a simulation of a given volume and number of sources, how does varying resource allocation affects the runtime?
  3. Can we predict the runtimes of simulations on the basis of these profiling tests?

In the first test, we used a simulation volume N_bltf = 211200 (N_bl = 55, N_t = 60, N_f =64) and ran the simulation for a range of Nsrcs and with different resource allocations. In the figure below, we plotted the simulation runtimes as a function of Nsrc for different resources:

image

We see that using multiple cores per MPI processing unit (PU) is disfavored compared to having more PUs for a given amount of resources. When Nsrc is large, adding more cores to a PU does not provide any benefit. Additionally, we observe that doubling the number of PUs does not decrease the runtime linearly for smaller Nsrc, however, it approaches a linear regime when Nsrc becomes larger. For example, the runtime decreases by only 25% for Nsrc=1.e4, whereas the runtime decreases by ~43% for Nsrc=3.e5 when the resources are doubled.

The second test aimed at understanding the runtime scalability with resources for a given simulation size (N_bltf=211200, Nsrc=3.e5). The following figure shows simulation runtimes for different number of PUs:

image

We fit a power-law to runtime vs PU data and find that the power-law index is slightly higher than -1. This suggests that the runtime does not scale down exactly linearly with PUs but not too far from a linear trend.

In the final test, we ran an additional simulation with N_bltf = 2534400 (N_bl = 55, N_t = 360, N_f = 128), which is 12 times bigger than previous simulation size. We recorded the runtime of this simulation for different Nsrcs. The figures below show runtimes of two simulation sizes for different Nsrc, and the ratio of runtimes and expected scaling based on N_bltf

image image

We find that the bigger simulation completes in a slightly shorter runtime than the expected runtime based on the size of the simulation (assuming a linear scaling with respect to N_bltf). Also, doubling the number of cores per PU does not affect the runtime significantly as shown by the green marker in the top panel.

jpober commented 3 years ago

Let's get this into the documentation (e.g. as a memo).

jpober commented 8 months ago

@dannyjacobs Do you know what (if any) work from Bharat is documented as a memo or otherwise? Keeping this around our issue log isn't doing much good, but it's really nice work and we'd love to get as complete a record of it into our docs folder.

dannyjacobs commented 7 months ago

Yeah, this was done as a crash effort to help with our first XSEDE computing proposal (which was successful) and maybe to answer JOSS reviewer questions?
I don't think it was ever formally memofied, but it looks to me like a complete report as is. How about I paste it into a doc format for long term. I don't think we yet have a consist place to put such things for RAGS or in this repo specifcally. Any thoughts on how to organize that? I could post it up on the loco memo log for linkability and we could add a readme to uvsim/docs/ with links to things like this.