SENPAI-Molecular-Dynamics / SENPAI

Molecular dynamics simulation software
https://senpaimd.org
GNU General Public License v3.0
126 stars 16 forks source link

Parallel computing #9

Closed Chelsea486MHz closed 2 years ago

Chelsea486MHz commented 5 years ago

I'm tired of running simulations overnight gimme some multithreading

raresraf commented 5 years ago

Hello.

I might be able to help with this issue from a programming perspective. Do you consider multithreading using OpenMP/pthreads technologies?

Chelsea486MHz commented 5 years ago

Wow. Someone addressing an issue!

I have never, ever, used any kind of multithreading. My undeducated guess would be to use pthreads, but it would break compatibility with other platforms (or am I mistaken?)

To be totally honnest I forgot I even had this issue open. My focus switched to OpenCL since while multithreading sounds cool, I doubt any consumer device can offer 1000+ threads...


The issue lies in universe_iterate:

universe_t *universe_iterate(universe_t *universe, const args_t *args)
{
  size_t i; /* Iterator */

  /* We update the position vector first, as part of the Velocity-Verley integration */
  for (i=0; i<(universe->atom_nb); ++i)
    if (atom_update_pos(universe, args, i) == NULL)
      return (retstr(NULL, TEXT_UNIVERSE_ITERATE_FAILURE, __FILE__, __LINE__));

  /* We enforce the periodic boundary conditions */
  for (i=0; i<(universe->atom_nb); ++i)
    if (atom_enforce_pbc(universe, i) == NULL)
      return (retstr(NULL, TEXT_UNIVERSE_ITERATE_FAILURE, __FILE__, __LINE__));

  /* Update the force vectors */
  /* By numerically differentiating the potential energy... */
  if (args->numerical == MODE_NUMERICAL)
  {
    for (i=0; i<(universe->atom_nb); ++i)
      if (atom_update_frc_numerical(universe, i) == NULL)
        return (retstr(NULL, TEXT_UNIVERSE_ITERATE_FAILURE, __FILE__, __LINE__));
  }

  /* Or analytically solving for force */
  else
  {
    for (i=0; i<(universe->atom_nb); ++i)
      if (atom_update_frc_analytical(universe, i) == NULL)
        return (retstr(NULL, TEXT_UNIVERSE_ITERATE_FAILURE, __FILE__, __LINE__));
  }

  /* Update the acceleration vectors */
  for (i=0; i<(universe->atom_nb); ++i)
    if (atom_update_acc(universe, i) == NULL)
      return (retstr(NULL, TEXT_UNIVERSE_ITERATE_FAILURE, __FILE__, __LINE__));

  /* Update the speed vectors */
  for (i=0; i<(universe->atom_nb); ++i)
    if (atom_update_vel(universe, args, i) == NULL)
      return (retstr(NULL, TEXT_UNIVERSE_ITERATE_FAILURE, __FILE__, __LINE__));

  return (universe);
}

And more spefically this line (that comes up several times):

for (i=0; i<(universe->atom_nb); ++i)

SENPAI is iterating through the universe to perform the same operation on each atom. This is stupidly inefficient and this process should be made parallel.

I just kept postponing this issue since I have no experience with either multithreading or parallel computing.

raresraf commented 5 years ago

pthreads should work fine with almost all Linux-based platforms, as it specifies a set of interfaces. We only need some minor modification to the makefile to link the pthread library.

However, I would also try to see how an OpenMP approach. This will involve an extra step in the compiling procedure, but we will figure it out. But before that, I would like to do some profiling on the code.

But in order to have concludent results, I need to have a solution to running a deterministic algorithm (i.e. For the given input, always have the same output). How can I run your software multiple times and get similar running times and same output?

When I tried running: time ./senpai.bin --in examples/DES.mol --out render.xyz --time 0.01

Is there somewhere in code a random generator ? Can we temporary suspend the non-deterministic approach in order to achieve some valuable profiling data?

Chelsea486MHz commented 5 years ago

Ouch, 8 hours is mean.

The mechanics are fully deterministic, but the simulations cannot be reproduced identically without some (light) changes. The way SENPAI works is it loads a system from a .mol file, and places copies of that system at random locations within the universe. The loaded system is loaded in universe_load, and the copies are inserted by universe_populate.

A quick fix that would allow for a simple simulation to be fully deterministic would be to have universe_populate load the first copy of the reference system at the offset (0,0,0) instead of some random offset.

With such a fix, a simulation such as senpai --in examples/DES.mol --out render.xyz --time 0.001 --reduce_potential 100000 --density 0.1 should always produce the same render.xyz file.

Note the --reduce_potential 1E6 --density 1E-3. The first flag sets the minimum potential before simuation to some obscene value. This is because SENPAI will apply random transformations to the system before simulating it if its potential energy is higher than the provided threshold (which defaults to 10 pJ, IIRC). The density flag is there to artificially inflate the universe size so as the give the loaded system some room to evolve correctly.

Another issue worth pointing out is that DES.mol is an attempt of mine to simulate a deep eutectic solvent. It was work I conducted while interning a lab this summer, and while it reflects what SENPAI is intended to do, such complex simulations are far out of reach for now, for the following reasons:


To sum up, if full determinism for development purposes is required:


Let me know if you have more questions, having zero concrete programming experience due to my curriculum I do understand that my methods, usage of git (cough pushing to master couch), and coding style might not make it easy to approach issues :)

raresraf commented 5 years ago

If Oscars were given for a job well done, I’d nominate you!

I did the steps that you have suggested and now we have a deterministic system, where I can further investigate which multithreading solution fits best. Moreover, I can check that my changes does not change in any way the output of the current application, assuring the persisting correctness of the program, regardless of the parallel execution.

All this setup looks extremely good! I will keep you updated. From what I see, the workload is Embarrassingly parallel so now it is just a matter of weeks until SENPAI will support multithreading execution.

Luthaf commented 5 years ago

FYI, there are three main strategies to parallelize this kind of software: shared memory parallelism (i.e. threads, OpenMP); distributed memory parallelism (i.e. MPI, running the code on multiple computers connected by high speed network); and accelerators (GPU with CUDA or OpenCL, Xeon Phi, etc.)

All of these strategies can be used together, and the biggest codes in the community use all of them. In order of complexity, I would rank them as shared memory < distributed memory == accelerators, since using distributed memory and accelerators require changes to the core algorithms used (domain decomposition, etc.)


Another thing to consider is what to parallelize: in most MD code, the most expensive step is the computation of the forces acting on all atoms, and the use of periodic boundary condition. The actual integration step only takes a fraction of this time. I don't know how much this apply to this project =)

raresraf commented 5 years ago

Indeed, you can choose to parallelize at different levels based on the architecture you aim to execute the code. While using accelerators (like GPUs) can result in having an outstanding speedup, there will be an explicit need of new hardware to support execution. Thus, the application is likely to become reliant on the architecture of the calculus system.

When working with distributed memory parallelism (using Message Passing Interface for instance, to achieve communication), this approach is likely to be more efficient when running on multiple systems.

If the application is meant to run on ordinary personal computers, where the number of cores is small, shared memory might be a better approach.


We may want to try both shared memory parallelism and distributed memory parallelism approaches. What to parallelize is again an important task to be solved. I think the best plan is to start with few profiling activities. I expect that next week I will be able to come with some preliminatory results.

Chelsea486MHz commented 5 years ago

MPI/acceleration is far beyond my skills, I'm afraid. So I won't be commenting on that

Another thing to consider is what to parallelize: in most MD code, the most expensive step is the computation of the forces acting on all atoms, and the use of periodic boundary condition.

SENPAI's bottleneck indeed lies in the horrible way it handles periodic boundary conditions. It's inefficient beyond any reasoning. I implemented those at the end of an internship this summer, when the heatwave and failed UV-vis stacked up to a massive burnout. I never got around to trying to convince myself to waste more time on them.

If the application is meant to run on ordinary personal computers

I actually have an old Proliant DL380 waiting to run the thing. If we do get around to try out MPI, I might invest in a few Raspberry Pi's.

I'm really busy right now, I'm prototyping a 3D-printable FTIR system, and learning good Git practice to avoid poluting SENPAI's repository now that I have contributions. As such I'll wait for development regarding parallelism before pushing anything, if anything happens it's going to be cleanup.

What to parallelize is again an important task to be

My inexperienced bets are on force_total and potential_total.

By the way, I lack experience with parallelism, so don't expect me to shine a light on any path, but would it be more efficient to paralellise within parallels threads? If that makes any sense. What I mean is something along the line of updating each atom's force in parallel, and each of the threads would again parallelise all the computations.

I have no choice but to put blind trust in contributions when it comes to this issue/feature.

raresraf commented 4 years ago

Some preliminary results: 1st run:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 76.33      8.55     8.55   380038     0.02     0.02  atom_enforce_pbc
  6.25      9.25     0.70 28985796     0.00     0.00  vec3d_mag
  5.72      9.89     0.64 16265488     0.00     0.00  vec3d_unit
  2.50     10.17     0.28   380000     0.00     0.01  force_total
  2.32     10.43     0.26 43048608     0.00     0.00  vec3d_sub
[...]

Call graph

[1]     99.9    0.01   11.18                 universe_simulate [1]
                0.00   11.17   10000/10000       universe_iterate [2]
                0.01    0.00   10000/10000       universe_printstate [20]
                0.00    0.00       1/1           universe_clean [44]
-----------------------------------------------
                0.00   11.17   10000/10000       universe_simulate [1]
[2]     99.7    0.00   11.17   10000         universe_iterate [2]
                8.55    0.00  380000/380038      atom_enforce_pbc [3]
                0.00    2.60  380000/380000      atom_update_frc_analytical [4]
                0.00    0.02  380000/380000      atom_update_acc [18]
                0.00    0.00  380000/380000      atom_update_pos [22]
                0.00    0.00  380000/380000      atom_update_vel [23]
-----------------------------------------------
                0.00    0.00      38/380038      universe_init [24]
                8.55    0.00  380000/380038      universe_iterate [2]
[3]     76.3    8.55    0.00  380038         atom_enforce_pbc [3]
-----------------------------------------------
                0.00    2.60  380000/380000      universe_iterate [2]
[4]     23.2    0.00    2.60  380000         atom_update_frc_analytical [4]
                0.28    2.32  380000/380000      force_total [5]
-----------------------------------------------
raresraf commented 4 years ago

2nd run

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 79.89     86.74    86.74  3800076     0.02     0.02  atom_enforce_pbc
  5.48     92.68     5.95 289808694     0.00     0.00  vec3d_mag
  5.05     98.16     5.48 162607114     0.00     0.00  vec3d_unit
  2.74    101.13     2.97  3800038     0.00     0.01  force_total
[...]

Call graph

granularity: each sample hit covers 2 byte(s) for 0.01% of 108.62 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     99.9    0.00  108.52                 universe_simulate [1]
                0.02  108.42  100001/100001      universe_iterate [2]
                0.07    0.01  100001/100001      universe_printstate [21]
                0.00    0.00       1/1           universe_clean [44]
-----------------------------------------------
                0.02  108.42  100001/100001      universe_simulate [1]
[2]     99.8    0.02  108.42  100001         universe_iterate [2]
               86.74    0.00 3800038/3800076     atom_enforce_pbc [3]
                0.03   21.42 3800038/3800038     atom_update_frc_analytical [4]
                0.01    0.14 3800038/3800038     atom_update_acc [17]
                0.03    0.03 3800038/3800038     atom_update_pos [22]
                0.03    0.01 3800038/3800038     atom_update_vel [23]
-----------------------------------------------
                0.00    0.00      38/3800076     universe_init [27]
               86.74    0.00 3800038/3800076     universe_iterate [2]
[3]     79.9   86.74    0.00 3800076         atom_enforce_pbc [3]
-----------------------------------------------
                0.03   21.42 3800038/3800038     universe_iterate [2]
[4]     19.7    0.03   21.42 3800038         atom_update_frc_analytical [4]
                2.97   18.44 3800038/3800038     force_total [5]
-----------------------------------------------
                2.97   18.44 3800038/3800038     atom_update_frc_analytical [4]
[5]     19.7    2.97   18.44 3800038         force_total [5]
                1.12    7.95 133801338/133801338     force_electrostatic [6]
                1.86    3.22 133801338/133801338     force_lennardjones [9]
                0.38    1.47 6800068/6800068     force_angle [10]
                0.83    0.00 140601406/140604218     atom_is_bonded [12]
                0.62    0.00 281202812/303603074     vec3d_add [13]
                0.50    0.00 140601406/430412912     vec3d_sub [11]
                0.09    0.41 6800068/6800068     force_bond [14]
[...]
Chelsea486MHz commented 4 years ago

I am bewildered.

More than 75% of the simulation time is just dealing with the periodic boundary conditions.

I'll open an issue just for this one. Parallel computing or not, this needs to be dealt with.

raresraf commented 4 years ago

Random question:

From what I see, the code currently does smth like

while (universe->time < args->max_time)
    for (i=0; i<(universe->atom_nb); ++i)
         [computations] 

What do you think if we change the order of loops, such that the code will become:

for (i=0; i<(universe->atom_nb); ++i)
    while (universe->time < args->max_time)
         [computations] 

This way, we might be able to parallelize easier the problem, as args->max_time >> universe->atom_nb (am I right?)

Are there any data dependencies between different iterations steps and different atoms?

I am bewildered.

More than 75% of the simulation time is just dealing with the periodic boundary conditions.

I'll open an issue just for this one. Parallel computing or not, this needs to be dealt with.

FYI: There is a well-explained tutorial how to use gprof https://www.thegeekstuff.com/2012/08/gprof-tutorial

Chelsea486MHz commented 4 years ago

Changing the order of the loops is impossible. The force applied to a single atom is a function of its distance to all other atoms, which changes constantly. An iteration is directly dependant on the last one.

This is one of the many reasons why the N body problem is such a PITA. The coordinates of each atom cannot be expressed as a function of time, and numerical integration is mandatory.

raresraf commented 4 years ago

Changing the order of the loops is impossible. The force applied to a single atom is a function of its distance to all other atoms, which changes constantly. An iteration is directly dependant on the last one.

This is one of the many reasons why the N body problem is such a PITA. The coordinates of each atom cannot be expressed as a function of time, and numerical integration is mandatory.

Got it! 👍

Btw, I re-run profiling with a new input. The previous profiling results were generated by running ./senpai.bin --in examples/DES.mol --out render.xyz --density 0.001 --time 0.01

Now I tried to increase the number of atoms Command: ./senpai.bin --in examples/DES.mol --out render.xyz --density 0.001 --copy 5 --time 0.001

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 25.01      0.35     0.35   190000     0.00     0.01  force_total
 16.79      0.59     0.24 35570000     0.00     0.00  force_lennardjones
 12.15      0.76     0.17 35570000     0.00     0.00  force_electrostatic
 10.00      0.90     0.14 108376320     0.00     0.00  vec3d_sub
 10.00      1.04     0.14 72394500     0.00     0.00  vec3d_mag
  7.14      1.14     0.10 37152960     0.00     0.00  vec3d_unit
  6.43      1.23     0.09 72940190     0.00     0.00  vec3d_add
  5.72      1.31     0.08 35981820     0.00     0.00  atom_is_bonded
  3.57      1.36     0.05 37117605     0.00     0.00  vec3d_mul
[...]
 0.00      1.40     0.00   190190     0.00     0.00  atom_enforce_pbc

Seems that periodic boundary conditions are not that terrible for performance.

raresraf commented 4 years ago

Please check:

raresraf commented 4 years ago

time

raresraf commented 4 years ago

speedup

raresraf commented 4 years ago

dat_eff

Chelsea486MHz commented 4 years ago

DAMN. Waking up to this on a Christmas morning!

This is fantastic work - I'll be reviewing it soon :)

Pesc0 commented 3 years ago

Hey, i just looked at your repo, supercool project indeed! I saw that one of your cool points was C99 compliance and portability, but if you want proper parallel computing I'm afraid you'll have to give up some of that... Also, while multithreading would be definitely an improvement, i think you would see the biggest gain by using a GPU; these are quite common in desktop PCs nowdays (laptops as well), and while that means you wont be able to run the program on a roomba you will still have decent portability in my opinion. Plus, realistically, if you want to run serious computations you will do that on a quite decent computer, unless you want to wait years for the simulation to finish. I'm suggesting looking into OpenCL or CUDA for GPU acceleration. Let me know what you think, maybe i can help out a bit ;)

Chelsea486MHz commented 3 years ago

Oof, I haven't touched parallel computing since 2019. I'm more than rusty, I wouldn't consider myself able to intervene in a beneficial manner.

I'm also thinking about using a GPU, just like all the big simulators do. There is no way SENPAI will be used in actual production environments without GPU computing.

However, SENPAI is FOSS and I just can't allow it to rely on CUDA and nVidia's desperate pushes to consume the entire HPC market. If it comes down to the OpenCL/CUDA choice, OpenCL it is.

Chelsea486MHz commented 3 years ago

Just a quick thought:

How about getting distributed computing (MPI support, so that it can run on entire clusters) going first, and then work on GPU computing ?

The end game would be to have both. Distributed computing on machines full of GPUs.

Pesc0 commented 3 years ago

well, i doubt you'll have a cluster at your disposal to mess around with anytime soon, while its much more likely you'll have a GPU (or more than one, see mining rigs) readily available to start computing right away.. in my opinion you would get the most benefit by implementing GPU support first, but thats just my opinion. Anyway a good stating point could be to start refactoring the code to support a "solver" module, which can then be implemented differently based on the parallel computing architecture you end up choosing. What do you think about that?

Chelsea486MHz commented 3 years ago

I actually have a cluster in my living room, a 12U rack cabinet full of servers waiting to be used aha. I have no GPU, though.

I bought the servers for this exact purpose, wanted to get GPUs too, but life happened. And the GPU shortage happened.

Chelsea486MHz commented 3 years ago

I don't mind sacrificing portability to have SENPAI run on Linux x86_64 exclusively, if it means getting MPI+OpenCL support. However, I'd still like SENPAI to run on personal computers. SENPAI was a useful learning tool for me, I'd like others to enjoy it as well. Maybe it should have a way to fall back to a more classic "non-distributed, CPU only" computing. Maybe with multithreading.

How would you describe the solver module, in a hypothetical MPI+OpenCL scenario?

Pesc0 commented 3 years ago

No joke! Well, let's go for that then! Although i have to warn you i have no MPI experience.. and no way of testing things out myself, since i have only one computer. As i said i think the first thing to do is prepare the current codebase to be parallelized, rather sooner than later, to avoid future complications when the simulations get more complex. That would mean splitting up the code where the computations happen so that the "computation part" can be implemented in different ways, while still maintaining the same program functionality. Basically what I'm thinking is: the code gets split. We have a "job", or a set of data that needs to be processed, and we have a solver (or multiple solvers, but you use only one at a time) that will process that job. We can have a "CPU only" solver, an OpenCL solver, and an MPI solver, and you choose which one to use. As an example, after the refactoring the CPU only solver would be functionally the same as the program is now: just straight for loops one after the other. While the other ones will have each their own parallelized implementation. Finally i have to remind you that parallelization is not a magic thing that goes brrr superfast. It can be applied only to specific problems which can benefit from parallel (read independent) computations. And you have to think how to synchronize the whole thing. For this reason i don't know if and how it would be possible to use MPI and OpenCL together. Heck, even distributing it across multiple GPUs might be a problem.