Change parallel python example

bkmgit commented 3 years ago

Would like to change the parallel python example to the code below

import numpy as np
import sys
import datetime
from mpi4py import MPI

def inside_circle(total_count):
    x = np.random.uniform(size=total_count)
    y = np.random.uniform(size=total_count)
    radii = np.sqrt(x * x + y * y)
    count = len(radii[np.where(radii <= 1.0)])
    return count

def main():
    comm = MPI.COMM_WORLD
    n_cpus = comm.Get_size()
    rank = comm.Get_rank()
    n_samples = int(sys.argv[1])
    if rank == 0:
        my_samples = n_samples - (n_cpus - 1) * np.int(np.floor(n_samples / n_cpus))
    else:
        my_samples = np.int(np.floor(n_samples / n_cpus))

    comm.Barrier()
    start_time = datetime.datetime.now()
    my_counts = inside_circle(my_samples)
    counts = comm.allreduce(my_counts, op=MPI.SUM)
    comm.Barrier()
    end_time = datetime.datetime.now()
    elapsed_time = (end_time - start_time).total_seconds()
    my_pi = 4.0 * counts / n_samples
    if rank == 0:
        print("Pi: {}, time: {} s".format(my_pi, elapsed_time))

if __name__ == "__main__":
    main()

Would also like to not use a large array for np.random.uniform(size=total_count) since this is not required, A loop in python is slow, the novice lesson does better optimization, but I do not know if a discussion of vectorization is needed in the intro lesson.

Comments appreciated.

reformatted by @tkphd using black, primarily for spacing

tkphd commented 3 years ago

I think @reid-a made some excellent decisions in the lesson code, from the perspective of teaching how to use clusters.

The algorithm is easy to explain, understand, and implement, even naïvely.
Vanilla Python loops are slow, but the NumPy library is vectorized C code and very efficient. This applies to ~both~ the Monte Carlo computation ~and the final vector sum (reduction)~.
Vectorized NumPy expressions dramatically reduce the number of lines of code, improving readability and decreasing the time commitment of live-coding.
To get the vectorized performance, you need a vector.
The memory required for these arrays forces this off of local hardware, if there is to be any accuracy in the result.

Changes could be made, and perhaps in HPC Python we should revisit this example to teach some better, higher-performance practices. That's a pretty good teaching pattern: introduce a bad way to do something, then incrementally improve it to show what's possible.

tkphd commented 3 years ago

MPI Allreduce is a blocking operation, meaning that it includes barriers internally. The two calls to Barrier() in the proposed code can be removed.

bkmgit commented 3 years ago

Compiled codes will generally vectorize the loop, so significant memory is not needed. Interpreted codes will need to use some sort of library. Is this a discussion worth having early on? I think the memory discussion can be postponed or some reason for using large amounts of memory in interpreted codes given.

tkphd commented 3 years ago

I think we should focus on the Serial vs. Parallel aspect, and stay away from discussing Interpreted vs Compiled languages, Vectorization, and memory footprint at this stage. The dedicated HPC Python lesson would be a much more appropriate place.

Compilers will unroll loops, but I'm not sure that's the same thing as vectorizing. Perhaps in the best case, with simple loop kernels and clear guards, the compiler will use a vector instruction, but again, there has to be a vector, and compilers tend to be conservative -- I think. I could be entirely mistaken, in which case enlightenment is welcome.

bkmgit commented 3 years ago

Ok, the memory material can be moved to the HPC Python lesson. Most C, Fortran, C++ compilers will vectorize such loops with optimizations turned on.

bkmgit commented 3 years ago

MPI Allreduce is a blocking operation, meaning that it includes barriers internally. The two calls to Barrier() in the proposed code can be removed.

The first barrier is needed to ensure timing is done for all processes. The second one can be removed due to the allreduce. Is it clearer to use an allreduce instead of reduce?

tkphd commented 3 years ago

For clarity, unless a Barrier is absolutely necessary, I feel that they should both be removed. Since the time-consuming parts are (1) local computation and (2) allreduce, and because MPI Reduce and Allreduce are both blocking (i.e., call Barrier internally), the deviation in timing between the Barrier and non-Barrier versions ought to be negligible. We can certainly test to make sure.

Either Reduce or Allreduce is applicable here; perhaps, since not every rank needs the final answer, calling Reduce instead would be better. The function arguments are the same: mpi4py assumes rank 0 is the target if nothing is specified.

bkmgit commented 3 years ago

The first barrier is needed. We want to write portable code. MPI does not need to run on homogeneous processors, so one can come up with a situation where some processors finish much faster than others.

Can replace allreduce with reduce.

tkphd commented 3 years ago

Portable code is a worthwhile goal, but introducing too many corner-case details is going to overwhelm learners. The goal of this lesson is to introduce basic concepts in small, bite-sized increments. Just conceptualizing parallel Reduce is enough.

This timer only operates on rank 0. If we really want to know how long it takes, we should call Reduce on the timer data as well -- which would actually serve the purpose of the lesson, i.e., reinforcing that the variables defined in the function are local to each process, not shared. We could use either MPI_MAX to find the longest runtime, or MPI_SUM and divide by n_cpus to get the average.

bkmgit commented 3 years ago

MPI_MAX is good and portable. Should one be using a heterogeneous cluster, this would enable a good discussion on load balancing.

tkphd commented 3 years ago

Such a conversation would be out of scope for this introductory lesson.

carpentries-incubator / hpc-intro

Change parallel python example #349