Solving for 100 particles

JohanCarlsen commented 1 year ago

This takes a very long time. Even with the Coulomb interactions, it takes forever. I'm using $dt=1.3\cdot10^{-2}$, which is the biggest one from the suggestions in the project description. This is obviously also the step with the biggest error, so I'm wondering how I should (if I should) be able to use a smaller $dt$.

anderkve commented 1 year ago

Hi @JohanCarlsen!

Even with the Coulomb interactions, it takes forever.

Presumably you mean even without the Coulomb interactions?

Here are some things to consider:

Are you writing more output than you need? Note that for Problem 9 you don't need to store all the positions and velocities for the 100 particles. Trying to do so can take a significant amount of time (and disk space), depending on how you do it.
What optimisation flags are you using when compiling your code? In my case I get a 5x speedup from adding the flag -O2. Going to -O3 might also help, though one should be a bit careful with too aggressive compiler optimisation.
If none of the above is sufficient, you should try to identify which part of the code is the bottleneck and see if you can write that in a way that is computationally more efficient.

For reference, when I run my 100-particle simulation for 500 microseconds with 40,000 timesteps, it takes ~10 seconds. This reduces to ~2 seconds when I add the -O2 optimisation flag.

JohanCarlsen commented 1 year ago

I1m not storing the positions (I think?). I only print the number of particles remaining after a complete simulation. What is a compiler optimization flag? Does that only speed up compiling, or also the run itself?

JohanCarlsen commented 1 year ago

Also, I have build my code to solve for all of the particle at once, by creating a temporary matrix in the evolve_RK4 method. This may also demand some memory, I suppose? The problem was that I couldn't find out how to write the code according to the indices-examples in the project description.

JohanCarlsen commented 1 year ago

You can take a look at the code here: https://github.com/Madssb/FYS3150_H22_4_pers Navigate to the branch johan and then to Project_3. Problem 9 is solved in main.cpp with penning_trap_2.cpp as the source file.

anderkve commented 1 year ago

What is a compiler optimization flag? Does that only speed up compiling, or also the run itself?

We've briefly touched on this in the lectures, but see here for the details: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html In short, the compiler can perform various actions on the code "under the hood" when it's compiling in order to reduce the run time. (The compilation typically takes longer.) The optimization flag determines how "aggressive" optimizations we allow the compiler to do. This is important, as many optimizations are e.g. based on making mathematical approximations that are typically OK, but which sometimes might cause problems or large errors. But as a rule of thumb, using -O2 is almost always OK.

I haven't looked at the code in too much detail, but here are some observations that might be relevant:

About this part: https://github.com/Madssb/FYS3150_H22_4_pers/blob/713e7f3f1bd232b86395fd3bb5d1843e701191fe/Project_3/src/penning_trap_2.cpp#L207-L219 If you want to delete elements from a vector you are looping through, it's typically best to loop through it backwards. I think the current loop will not do what you expect it to, since the indices of all later elements will shift by -1 every time you delete a particle. Say you have particles [a,b,c]. If you now delete particle a when i=0, then the new vector of particles is [b,c]. But the loop then continues to index i=1, which is particle c. So particle b was never checked.
If-tests inside for-loops can slow codes down. Often it is unavoidable, but I'd suggest to try to completely remove the if-test above for deleting particles. As long as the code for the E and B forces have a check that sets the force to 0 when r > d, it's OK to just keep all particles throughout the simulation. (If they escape the trap, they should just continue in a straight line since all forces are 0.) Then after the simulation, you can do a single check of how many particles have r < d.
In your code you do a lot of reading and writing of 3x100 matrices, and pass them around as function arguments. I haven't spotted a specific part that looks very slow, but in sum I think this can add up to make the code slow. One simple thing you can try, is to switch from "pass by value" to "pass by const reference" when sending matrices as function arguments. (We discussed this in some lecture a while ago.) E.g. this function https://github.com/Madssb/FYS3150_H22_4_pers/blob/713e7f3f1bd232b86395fd3bb5d1843e701191fe/Project_3/src/penning_trap_2.cpp#L119 can be changed to

arma::mat PenningTrap::total_force_external(double t, const arma::mat& R, const arma::mat& V)

If you want to try a more substantial redesign along the lines of our code snippet (I can't guarantee that this will solve the problem): You shouldn't really need to have the various PenningTrap member functions communicate the particle positions and velocities through function arguments. All PenningTrap member functions can directly access any particle positions and velocities from the PenningTrap::particles vector. Though, in a design such as that hinted at by the code snippet, the FE / RK4 evolve functions would need to start by making a temporary copy of all the current particles. That allows you to loop over those local copies and update their positions and velocities one by one without "disturbing" the current state of the PenningTrap. (Thus, the different forces can safely use the positions and velocities from PenningTrap::particles directly, since this info always reflects the current state of the trap.) Once you have computed all the new particle positions and velocities, you can write these to the particles in PenningTrap::particles. So conceptually something like this:

// Forward Euler
void PenningTrap::evolve_forward_Euler(double t, double dt)
{
  // Temp copy of particles
  std::vector<Particle> particles_new = particles;

  // 
  // [Insert your code that loops through particles_new
  // and updates the position and velocity of each Particle instance
  // in that vector.] 
  // 

  // Loop over all particles in the trap to set the new positions and velocities
  for (int i=0; i < particles.size(); i++)
  {
    particles[i].r = particles_new[i].r;
    particles[i].v = particles_new[i].v;
  }
}

Hope some of this helps!

JohanCarlsen commented 1 year ago

I tried to pass by const reference, but that did not help much. Or, at least I don't think so - I do not possess that kind of patience. I will try to rewrite the entire code for prob 9 according to the code snippets.

Assuming the two ways of writing this code (the snippets way and the way I tried) are implemented correctly, would one be faster than the other? If so, why? Due to memory needed, or another reason?

anderkve commented 1 year ago

I tried to pass by const reference, but that did not help much.

Yeah, I didn't expect it would help too much, but it was worth a try.

Assuming the two ways of writing this code (the snippets way and the way I tried) are implemented correctly, would one be faster than the other? If so, why? Due to memory needed, or another reason?

That's a big question and a proper answer would require a more detailed analysis of your code than I have time for right now, I'm afraid. But in general I don't think it's the amount of memory needed that's the problem: you seem to only ever keep a few 3x100 matrices in memory at the same time, and regardless of the approach taken, the 300 doubles for positions and 300 doubles for velocities are anyway something that must be kept in memory. What might be the problem is the layout in memory, if it causes the machine to have to stride through a lot of memory every time an entry in one of your matrices is updated. We have some introduction to these issues here: https://anderkve.github.io/FYS3150/book/optimization/memory_is_the_bottleneck.html I haven't thought carefully about whether the current way you're doing this with your armadillo matrices is expected to be slow or not. But one thing you can try is to take a smaller example (say only 10 particles and not too long simulation time) and see if it makes any difference if you organise your matrices as 100x3 vs 3x100. With the redesign hinted at with the code snippet, there will be no matrices involved at all, only some std::vector<Particle> vectors, which will be organised as just a series of Particle instances right next to each other in memory, so looping through them one particle at a time should not introduce much unnecessary searching through memory.

Another thing that might be worth a try before starting a complete redesign is to run a quick profile of your code, to see if it can tell you where the bottleneck is: One program you can use for this is gprof, see instructions here: https://sourceware.org/binutils/docs-2.17/gprof/

Very short summary of instructions:

Add the flag -pg when you build your program
Run your program as usual. (The program must run to completion, so it's probably a good idea to run a small simulation.) You should now have a file called gmon.out
Run gprof to get info on the most-used parts of your program: gprof <name of your executable> gmon.out

There are explanations about what the gprof output means in the output and via the link above, but in short it tells you which parts of the code your program spends the most time in.

Finally, if you find that attempting to speed up your code is taking too much time, it may be better solve Problem 9 using a smaller set of particles and rather spend the time on writing the report. (You can probably get reasonable results with say only 10 particles.) Just explain in your report the reason why you had to run with fewer particles.

JohanCarlsen commented 1 year ago

If I use the indices approach, how are we supposed to only pass an index? When we do that, I assume that we should do something like arma::vec r = particles[i].r; in the functions, but these will not be updated until after the evolve_RK4 has been run? Or should we update them "as we go"?

anderkve commented 1 year ago

I might misunderstand the the question, but perhaps this is answered by the note at the end of Problem 7, about what steps the RK4 code should go through for a single timestep dt?

In a bit more detail, one suggestion along the lines of that note is doing something like this in the evolve_RK4function:


// Create a temp copy of the the particles contained in PenningTrap::particles
std::vector<Particle> particles_old = particles;

// Collections of arma::vec's to store the RK4 k-values for each particle
std::vector<arma::vec> k1_r(particles.size());
std::vector<arma::vec> k1_v(particles.size());
std::vector<arma::vec> k2_r(particles.size());
std::vector<arma::vec> k2_v(particles.size());
std::vector<arma::vec> k3_r(particles.size());
std::vector<arma::vec> k3_v(particles.size());
std::vector<arma::vec> k4_r(particles.size());
std::vector<arma::vec> k4_v(particles.size());

// RK step 1:

// Loop over all particles in the Penning trap. For each particle i (i.e. particles[i]): 
// - use the current velocity of particles[i] to set k1_r[i]
// - use the current total force on particles[i] to set k1_v[i]

// RK step 2:

// Loop over all particles in the Penning trap. For each particle i: 
// - update the position of particles[i] using particles_old[i].r, k1_r[i] and the appropriate factor 0.5 for step 2 of RK
// - update the velocity of particles[i] using particles_old[i].v, k1_v[i] and the appropriate factor 0.5 for step 2 of RK

// Loop over all particles in the Penning trap. For each particle i: 
// - use the current velocity of particles[i] to set the new k2_r[i]
// - use the current total force on particles[i] i to set the new k2_v[i]

// RK step 3:

// (Analogous to step 2)

// RK step 4:

// (Analogous to steps 2 and 3)

// Final step:
// Loop over all particles in the Penning trap. For each particle i: 
// - update the position of particles[i] using particles_old[i].r, k1_r[i], k2_r[i], k3_r[i] and k4_r[i]
// - update the velocity of particles[i] using particles_old[i].v, k1_v[i], k2_v[i], k3_v[i] and k4_v[i]

This way, the state of the Particle instances in PenningTrap::particles always correspond to the current whole/half timestep of the RK4 algorithm, so to get e.g. the total force at the given time all you need is to pass an index i (plus the appropriate current time, once you include the time-dependent external force in Problem 9.)

JohanCarlsen commented 1 year ago

So I rewrote the code, and even though a test run for $t=5 \mu s$ results in the wrong result, a run for $t=500 \mu s$ still takes a really long time.

I really have no idea why it takes so long time, compared to what you told me your code runs in..

anderkve commented 1 year ago

Hmm. Did you try to do the profiling with gprof? And just to check, are you compiling with -O2 or -O3? And have others in your group tried to run your code, just to test it on some other hardware?

If nothing seems to help, I'd advise you to just use the code you trust the results for and scale down the problem (number of particles) until you get runtimes that you can work with.

JohanCarlsen commented 1 year ago

I get this output when trying to profile: g++ -O2 main.cpp src/particle.cpp src/penning_trap_2.cpp -std=c++11 -I include -o main.exe -larmadillo -pg clang: error: the clang compiler does not support -pg option on versions of OS X 10.9 and later

I'm compiling with -O2

I think I'll scale the problem for now, so we can continue writing the report.

One final question; do you think the fact that I'm writing all the fractions of particles left in the trap to a matrix, and then opening a file with fstream contribute to the code taking long time? It would be really nice to understand what makes the code so slow, even though we might not be able to fix it.

anderkve commented 1 year ago

I get this output when trying to profile: g++ -O2 main.cpp src/particle.cpp src/penning_trap_2.cpp -std=c++11 -I include -o main.exe -larmadillo -pg clang: error: the clang compiler does not support -pg option on versions of OS X 10.9 and later

Hmm, OK, looks like that's for g++ only then. But there should be MacOS options for profiling, see e.g. here: https://rotadev.com/profiling-c-on-mac-os-x-dev/

A naive profiling can also be achieved by just adding some cout statements in the code to try to observe if some parts of the code seem to take surprisingly long time.

I think I'll scale the problem for now, so we can continue writing the report.

Sounds good.

One final question; do you think the fact that I'm writing all the fractions of particles left in the trap to a matrix, and then opening a file with fstream contribute to the code taking long time? It would be really nice to understand what makes the code so slow, even though we might not be able to fix it.

I'd be surprised if that is the reason, since modifying this matrix is something that only happens once per frequency step. I expect the slowness is rather related to the simulation itself. But it's of course easy to check: just comment out all the code associated with this matrix and file writing and rerun the code. I agree that it would be very good to understand why it's so slow...

anderkve / FYS3150

Solving for 100 particles #65