Changing Codes from Numba to Numba CUDA

LucioLau commented 1 year ago

Hi everyone,

I am recently trying to work on simulating the payments between banks. I have codes written in Numba Python which is here https://github.com/LucioLau/simulation/blob/LucioLau-patch-1/Sim_CUDA.ipynb. These are codes which read the Excel file that contains the simulated payments between banks and carries out the simulation. The simulated data are generated in the other file here https://github.com/LucioLau/simulation/blob/65d3d456076e25537b6933a43034c2a7e4c2e739/Sample_generator_homogeneous_banks.ipynb. The simulation of one sample is fast, however, the iteration is very large. I would like to ask if I can submit it to CUDA. As each iteration looks heavy in space complexity, I don't know if it is doable for a display card.

Any comments on improving the code are welcomed!

Many thanks,

Lucio

P.S. I am new to github, I did want to attach my codes within this space, but I have no clue how to do so.

zafar-hussain commented 1 year ago

Hi @LucioLau ,

If you can, please give a test case for sample_generator(k, N, T, _alpha), with values of k, N, T, _alpha.

Add border cases, like a small data set that finishes quickly, and a medium one, which gives you reasonable execution times, plus a large one, that takes very long, even doesn't finish on your machine

Allowable data types for k, N, T, _alpha, would be very helpful, plus a sample of the Excel file, you mentioned above.

cheers zafar

LucioLau commented 1 year ago

Hi @zafar-hussain ,

The sample generator has no issue itself. It is kind of slow just because S = 10_000.

My problem comes from the simulation function in the first link, where 1 iteration is fast, however, I have to loop it for 51 x 6 x 701 x 71 x 10,000 = 1.52 x 10^11 times. So I am thinking of submitting it to the GPU, but I have no clue if the GPU is capable of handling this. Comments on optimising the code are also welcomed.

Cheers,

Lucio

P.S. The simulation idea comes from the paper Payment prioritisation and liquidity risk in collateralised interbank payment systems by De Caux et al.

zafar-hussain commented 1 year ago

Hi @LucioLau,

in Thread_sim,

p_delay += result[0]
np_delay += result[1]
total += result[2]
ip_delay += result[3]
inp_delay += result[4]
itotal += result[5]

you are updating these variables before creating them

in simulation, n_sample, time, bank, payee, pri, and amount are used as global variables, this will slow numba down.

Pass them as arguments, so that numba can use them locally within the simulation's namespace, else on every use it will have to go to the global space to access them.
use @njit(error_model='numpy', nogil=True, fastmath=True),

Also pass the arguments and return signatures, it will speed up computations quite a bit, as numba will generate fast vector code, especially for a contiguous array

See https://numba.pydata.org/numba-doc/dev/reference/types.html?highlight=contiguous#types-and-signatures
The code is memory bound, not computational bound, hence GPU won't give you much performance boost.
The conditional branches (if-then-else) within the code, will hinder vectorizing the computations, try using masking

Cheers zafar

LucioLau commented 1 year ago

Hi @zafar-hussain,

Thank you for your advice, I have applied your suggestions and uploaded them as https://github.com/LucioLau/simulation/blob/main/Sim_CUDA2.ipynb.

I don't completely understand what you mean in your second last and last point (sorry about that, I don't have a computer science background but a maths one).

The code is memory bound, not computational bound, hence GPU won't give you much performance boost. As I am using Intel 13700, there are 24 cores to utilise; and RTX 4080, there are 9728 CUDA cores. For RAM on the CPU, I have 16GB, the same as the memory available on the Nvidia website. As a maths guy, I would guess I will be faster to run in the GPU (I don't know if the memory has to be divided into 9728 cores when I want to run them parallelly).
The conditional branches (if-then-else) within the code, will hinder vectorizing the computations, try using masking I have completely no clue about what you mean, sorry about that.

Other than the two points above, I guess I have applied what you have suggested and hope they are optimised. Many thanks.

Cheers,

Lucio

zafar-hussain commented 1 year ago

Nicely done @LucioLau,

What I mean by the code being memory bound, is that the code accesses memory a lot. cpus have three tier caches to hold the data, but gpus caches are quite small, hence they have to get data from their shared (slow) memory

Hence the more we can reuse data, the less gpu will have to access slow memory (bound by memory bandwidth)

the for loops get converted into parallel computation, ie. cpu/gpu will process multiple data elements in one clock cycle, if that computation can be vectorised, but if a conditional loop like a if-then-else comes within a for loop, it gets difficult for the compiler to vectorise.

I am free from Tuesday, and will profile your code and help you takecare of above two points.

Please if you can add some tests for the code, so that I don't break anything in your code

cheers

zafar

LucioLau commented 1 year ago

Hi @zafar-hussain,

Thank you for the further explain, I think I understand what you mean now.

For tests, what I usually do is to run simulation on the sample generated by the sample generator with any arrays of liquidity and buffer array with size N, e.g. np.ones(N) for both and compare the result with the previous version.

Many thanks,

Lucio

LucioLau / simulation

Changing Codes from Numba to Numba CUDA #1