disordered-photonics / celes

CELES: CUDA-accelerated electromagnetic scattering by large ensembles of spheres
Other
48 stars 18 forks source link

For-Loops #28

Closed arunoruto closed 2 years ago

arunoruto commented 3 years ago

While trying to understand the framework for my Master-Thesis, I noticed a frequent use of for-loops throughout the code. Usually such loops are quite inefficient and should be swaped out for matrix/vector multiplications. Is there and intent behind the loops or was it used to just solve the problem? I am asking since I wanted to try to improve the performance by removing such for-loops, which can be swaped out, but I wanted to make sure I am not doing it in vain.

lpattelli commented 3 years ago

Hi Mirza, thank you for your interest in CELES! In general, I think there are several places where the code could be improved, performance-wise. Vectorization of for loops can certainly helpful in certain cases. MATLAB should have a built-in profiler which you could use to understand how much it can help to optimize certain parts of the code based on the typical configurations that you intend to study, so that you can focus your efforts on the most relevant bottlenecks and see if they are actually associated with unoptimized for loops.

For instance, when trying to simulate configurations containing thousands of small particles (which is one of the main intended applications for CELES) I think that most of the runtime is spent during the iterative solver routine, performing vector-matrix multiplications which are handled by the following (parallelized) CUDA kernel: src/scattering/coupling_matrix_multiply_CUDA.cu

Different schemes to address this time-consuming step can be envisioned, but implementing them requires a different type of effort. I think that Amos had already tested some alternative solutions such as the fast-multipole method or a rotation-translation-rotation scheme, but for this "superposition" T-matrix implementation they did not eventually turn out to be more efficient than the current brute-force multiplication, I believe.