UCL-RITS / rse-classwork-2020

4 stars 113 forks source link

Measuring performance and using numpy #185 #201

Open nuttamas opened 3 years ago

nuttamas commented 3 years ago

Running the given function without numpy

python -m timeit -n 100 -r 5 -s "from calc_pi import calculate_pi_timeit" "calculate_pi_timeit(10_000)()"

get

100 loops, best of 5: 7.66 msec per loop

After use numpy functions, run:

python -m timeit -n 100 -r 5 -s "from calc_pi_np import calculate_pi_timeit" "calculate_pi_timeit(10_000)()"

get

100 loops, best of 5: 2.23 msec per loop

numpy makes the code faster.

nuttamas commented 3 years ago

Profiling code #186 Try the given code in the JupyterNotebook

def sum_of_lists(N): total = 0 for i in range(5): L = [j ^ (j >> i) for j in range(N)]

j >> i == j // 2 ** i (shift j bits i places to the right)

   # j ^ i -> bitwise exclusive or; j's bit doesn't change if i's = 0, changes to complement if i's = 1
   total += sum(L)

return total

and run %prun sum_of_lists(10_000_000)

Got result below

14 function calls in 13.066 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function)

5 10.708 2.142 10.708 2.142 :4() 5 1.422 0.284 1.422 0.284 {built-in method builtins.sum} 1 0.761 0.761 12.891 12.891 :1(sum_of_lists) 1 0.175 0.175 13.066 13.066 :1() 1 0.000 0.000 13.066 13.066 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

The lines that take most time are L = [j ^ (j >> i) for j in range(N)] and total += sum(L)

nuttamas commented 3 years ago

Then try %load_ext line_profiler %lprun -f sum_of_lists sum_of_lists(10_000_000)

Got the result

Timer unit: 1e-06 s

Total time: 19.2983 s File: Function: sum_of_lists at line 1

Line # Hits Time Per Hit % Time Line Contents

1                                           def sum_of_lists(N):
2         1          2.0      2.0      0.0      total = 0
3         6         22.0      3.7      0.0      for i in range(5):
4         5   18459236.0 3691847.2     95.7          L = [j ^ (j >> i) for j in range(N)]
5                                                   # j >> i == j // 2 ** i (shift j bits i places to the right)
6                                                   # j ^ i -> bitwise exclusive or; j's bit doesn't change if i's = 0, changes to complement if i's = 1
7         5     839024.0 167804.8      4.3          total += sum(L)
8         1          1.0      1.0      0.0      return total

The line that takes most time is the looping command: L = [j ^ (j >> i) for j in range(N)]

nuttamas commented 3 years ago

Approximating π using Numba/Cython #195

run calc_pi_numba.py, got the result

Elapsed (with compilation) = 900 msec pi = 3.1272 (with 10000) Elapsed (after compilation) = 180 μsec pi = 3.142 (with 10000)

The code takes much less time than the original.

Using Cython, got the result

5.48 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

takes longer than number

nuttamas commented 3 years ago

Improving performance using MPI #194 The code uses the Message Passing Interface (MPI) to accomplish this approximation of π.

run python calc_pi.py -np 10_000_000 -n 1 -r 1 got

pi = 3.1411436 (with 10000000) 1 loops, best of 1: 7.97 sec per loop

numpy is the fastest MPI could work faster when performing parallel tasks.