Code benchmark - Githubissues

NikoOinonen commented 4 months ago

So we were planning to add a graph of the computational scaling of the code as a part of the paper (#68). I did a simple benchmark of the execution time of the GPU code by using various different super-cell sizes of a graphene sheet. The result looks something like this: The execution time here is the pure computation time without any of the IO or device array transfers.

A couple things to consider:

Should we also run this benchmark for the CPU code? It might not be as meaningful, since the CPU code spends a significant portion of the time simply reading and writing files from/to the disk.
Should we include the script and data that produces this plot as an example in the repo?

@ProkopHapala @mondracek @ondrejkrejci @yakutovicha @aureliojgc What do you guys think?

ProkopHapala commented 4 months ago

Hi Niko, perfect. It seem very nice.

If I understand it for FDBM the number of atoms is irrelevant, but number of atoms scales liberály with number of atoms (both scale in 2D xy plane while z-size is fixed) right?
This is just Forcefield generation step (which is the time limiting) or the whole calculation (including tip relaxation?
Yes I think we should measure this also on CPU to show the speedup. I would not count I\O operations ass well. It should showcase the efficiency of the parallelization. Now with OpenMP it may make sense to remove disk I\O bottleneck from CPU part as well in future.
I guess the GPU numbers will depend very much on GPU model. It would be good to show results for some low-end GPU model (like laptop) and some high end (like datacenter). I'm curious to try it on my.

NikoOinonen commented 4 months ago

* If I understand it for FDBM the number of atoms is irrelevant, but number of atoms scales liberály with number of atoms (both scale in 2D xy plane while z-size is fixed) right?

I assume you mean the number of grid points scale linearly with the number of atoms? Yes, although the match is not quite exact because the grid step is not commensurate with the lattice spacing. The number of atoms in FDBM matters less, but it's still there because you have to compute the vdW term.

* This is just Forcefield generation step (which is the time limiting) or the whole calculation (including tip relaxation?

This is the whole calculation: force-field, tip-relaxation, and convolution.

* Yes I think we should measure this also on CPU to show the speedup. I would not count I\O operations ass well. It should showcase the efficiency of the parallelization. Now with OpenMP it may make sense to remove disk I\O bottleneck from CPU part as well in future.

Ok, I can do it, but it's a bit ugly since I will have to modify all the CLI scripts to get the timings of just the computation, so this cannot be done in a single script. Should we show just single-core performance or with the OpenMP parallelization?

* I guess the GPU numbers will depend very much on GPU model. It would be good to show results for some low-end GPU model (like laptop) and some high end (like datacenter). I'm curious to try it on my.

Yeah, I can do this as well. Only factor to consider here is that the number of grid points grows so large for the biggest system size that it will run out of memory on a low-end GPU. I can upload the script and the data at some point so you can try as well.

ProkopHapala commented 4 months ago

This is the whole calculation: force-field, tip-relaxation, and convolution.

Ok, I can do it, but it's a bit ugly since I will have to modify all the CLI scripts to get the timings of just the computation, so this cannot be done in a single script

Aha, maybe if we compare just the FF generation part it would be easier and more apples-to-apples ?

single-core performance or with the OpenMP parallelization?

Good question. perhaps it would be good to see how ti scales, but perhaps not for all

NikoOinonen commented 4 months ago

Aha, maybe if we compare just the FF generation part it would be easier and more apples-to-apples ?

Well, you still have to run multiple scripts just for the force field (LJ, electrostatics, vdW separately for the FDBM) so it's maybe all the same.

What I'm wondering is should we put all of the different measurements into one graph, each separately, or somehow divided? Maybe it would make sense to have one graph like the first comment above that shows how the different methods compare to each other using one device, and then a second graph that shows the difference between different compute devices for just one FF method (e.g. FDBM).

mondracek commented 4 months ago

Regarding where to upload the source data and the script to make it public:

On the one hand, if we are going to mention it in the paper, we have to make it available to the public somewhere.
On the other, I have doubts whether it should be included as an example in the ppafm/examples directory. I mean, the primary purpose of the directory is to include examples that can be used as a kind of guide to the users, perhaps even a template to be copied and modified by users to set up their own production calculations. The time scaling test is probably not particularly instructive as a template for such purpose. But then again, when installing the code, the users may be interested in how the scaling on their computers compares to that published by us.
I am not sure how suitable Zenodo is for uploading pieces of code (scripts) together with the inputs for them. If we need to put it somewhere on GitHub anyway, I don't think it makes much sense to create a separate directory (outside examples) or even a separate repository solely for this purpose, I would put in to ppafm/examples then.

So,

either if you feel the users might be interested in running the test for themselves on their machines when they install the ppafm code,
or if it would be too much hassle to publish it on Zenodo,

I recommend placing it in the examples directory of the ppafm repo. Otherwise, I would upload it to Zenodo instead.

ondrejkrejci commented 4 months ago

Hi, sorry for my late response. Thank you @NikoOinonen for the data. When I first saw this (2) comment from referee, then I thought something in this direction: FHI-aims gpu acceleration manuscript - have a look at pages 31-33. Thus I would really suggest just compare OpenMP-full CPU vs. GPU. I also agree with @ProkopHapala - that we should time, each procedure separately in terms of CPU and sum it up for the final number, if it is not a big deal. and just to comment on it in the text. as for storing - I opt for a special script on Zenodo - we could put together, e.g. the L-J computation could be put into an new script generating script, which would have the timing in the beginning and just before the saving. I think that we can change the text in 5.2.2 (2nd paragraph accordingly to the image).

ProkopHapala commented 4 months ago

Maybe it would make sense to have one graph like the first comment above that shows how the different methods compare to each other using one device, and then a second graph that shows the difference between different compute devices for just one FF method (e.g. FDBM).

Yes, I was thinking about something like that. Two graphs would be optimal I think, but what exactly compare on each would be better decide when we have the data and see how crowded it is.

ProkopHapala commented 4 months ago

I also agree with @ProkopHapala - that we should time, each procedure separately in terms of CPU and sum it up for the final number, if it is not a big deal. and just to comment on it in the text.

On the other hand the graph @NikoOinonen plotted for GPU seems very nice, it shows rather clear simple scaling so the overhead is negligible. I would keep it. The question is if this stay the same for CPU where the (I\O) overhead is probably bigger.

NikoOinonen commented 4 months ago

benchmark_all

Here is comparison for different compute devices using the FDBM force field. The CPU is Intel i9-13900K, and the GPUs are Nvidia GTX 1650Ti Mobile, AMD RX 6700 XT, and Nvidia A100. I think this shows quite nicely the benefit of the GPU. Even the laptop GPU is at least an order of magnitude faster than the CPU at its best.

A couple of remarks:

The CPU time is just the computation time without any of the I/O, just like the GPU time.
The CPU time does not scale that well with higher number of threads because the numpy FFT part does not utilize multiple threads. Also, for this particular CPU model, the cores are heterogeneous, some of them are faster/slower than the others.
The laptop GPU is lacking the last two points because there is not enough VRAM.
The curve for the server GPU seems to be pretty flat and actually slower than the desktop GPU for small system size. Probably the bottleneck here is some other place than the GPU, and the CPUs on the servers are usually slower per-core than desktop CPUs, so this could explain it.

I thought something in this direction: FHI-aims gpu acceleration manuscript - have a look at pages 31-33.

I'm not sure what you are referring to. The PDF only has 15 pages. Figure 7?

NikoOinonen commented 4 months ago

we could put together, e.g. the L-J computation could be put into an new script generating script, which would have the timing in the beginning and just before the saving.

This is going to get a bit complicated because the saving is actually happening inside the procedure in the HighLevel.py: https://github.com/Probe-Particle/ppafm/blob/8ebe87345b87d27a3ceb6a26e68c4ee89d82c70a/ppafm/HighLevel.py#L268-L274 The way that I did it above is that I inserted a print with the timing around each of the steps, ran it for all of the different grid sizes, and then afterwards picked out the timings with a regex from the log files.

Additionally, above for the FDBM model I had to manually resample the tip density into each of the different grid sizes because the CPU code currently does not do this automatically (unlike the GPU code).

So the script would have to be quite complicated, and therefore probably would not be that useful/instructive for others. I would probably leave this out of the Zenodo for the manuscript, if it's not absolutely needed.

ondrejkrejci commented 4 months ago

I like the image very much the image, thank you @NikoOinonen . Thinking about it - I would remove the server, to omit questions about the jump from referees. From my point of view, I would put there only 1CPU, then 32 CPUs and a desktop GPU as an example. Maybe you can make an SVG and then we can very easily change things on it, based on a common agreement rather than replotting many times. Of course feel free to suggest anything else. Also @ProkopHapala - Please could you comment on the lines that we would actually plot?

By the way - this is the image I was referring to. Working from home, I am using the open manuscript:

As for the example. @NikoOinonen is right. This time we are not revealing new physics as such, just timing, so it is probably fine not to have anything. If it is easy to put just the geometries on Xenodo, then I think that it is fine, so other can do the timing for their GPUs, or for us in the future.

ProkopHapala commented 4 months ago

I like the image very much the image, thank you @NikoOinonen .

Yes, me to. Very nice.

I would remove the server, to omit questions about the jump from referees.

I don't see problem with that. I would keep it and just write what Niko suggested in the discussion. I mean:

The curve for the server GPU seems to be pretty flat and actually slower than the desktop GPU for small system size. Probably the bottleneck here is some other place than the GPU

Also @ProkopHapala - Please could you comment on the lines that we would actually plot?

I would just keep it as it is.

Just 2 minor comment about formating:

I would plot it with grid-lines plt.grid() because it helps to see actual speedup numerically (without the grid lines it is hard to follow if it is far from the axis, without using ruler).
Maybe I would also adjust x-axis so that we see $10^5$ tick label for number of grid points and $10^{-3}$ and $10^3$ [s] for execution time on the y-axis

NikoOinonen commented 4 months ago

benchmark_all

Adjusted according to @ProkopHapala's suggestion.

I'm thinking maybe this one figure is enough. We already have some comparison numbers for the different force-field models in the text, so the first figure is maybe not adding too much useful information.

Also, the relationship between the different force-field models depends quite a lot on the type of system. For example in the above figure the LJ+PC is not significantly faster than the FDBM or even slower for large system size, but this is a periodic system so there are relatively many atoms per voxel, whereas for a non-periodic system the LJ+PC would be significantly faster, so the figure is not really representative of the general case.

ProkopHapala commented 4 months ago

Adjusted according to @ProkopHapala's suggestion.

Yeh, I think it looks perfect now.

I'm thinking maybe this one figure is enough. We already have some comparison numbers for the different force-field models in the text, so the first figure is maybe not adding too much useful information.

@NikoOinonen : IMO, your previous figure carry useful information and present it better (more clearly and rigorously) than what we wrote previously in the text. Therefore I would keep both figures (with similar formating with gridlines and tick labels).

But let's see what other think. I don't insist on this.

@ondrejkrejci : Also I do not really want to shorten the "review" part text (it takes us time to consider what to remove, and I think generally all the information written there I find usefull). So I'm speculating, if we show we adress reviewer suggestion seriously in this point (rather detailed performance tests), we may be free to ignore more his suggestion to shorten the review part. Unless we are hitting some size limits?

Generally I'm inclined to give rather more information than less.

NikoOinonen commented 4 months ago

IMO, your previous figure carry useful information and present it better (more clearly and rigorously) than what we wrote previously in the text.

But I kind of think that it's not really that rigorous, since it's just for a single type of system. In order to make it actually representative, we would have to run the test for several different kinds of systems and show how the scaling is for each one of them, because the bottleneck will be in different places.

ProkopHapala commented 4 months ago

But I kind of think that it's not really that rigorous, since it's just for a single type of system. In order to make it actually representative, we would have to run the test for several different kinds of systems and show how the scaling is for each one of them, because the bottleneck will be in different places.

What would differ from system-to-system? I think the FF generation depends just on number of atoms, and/or grid size - and we show both on the x-axis. The relaxation can vary (probably) depending on system chemistry (e.g. if probe get stuck in some where, it may have problem to converge, or take more iterations), but relaxation time is rather small in comparison to FF generation. And nobody should ask us to test it for all possible systems just for that.

I think for end-user it is good to show clearly what is the total simulation time (i.e. both FF generation + relaxation) even though we know that FF generation is the bottleneck, because the total time is what user cares about, and it make it more clear show it on plot, than to explain it in the text.
- So my idea was to show the first image to give clear message to the end-user about the total GPU speed (the mesage is - it is pretty fast all things considered), and than on the 2nd picture show more rigorous testing just for the FF to compare the platforms (because it is difficult to compare the total time on different platforms due to differences in workflow and I/O overheads ). To me both informations make sense and are valuable for the user.

NikoOinonen commented 4 months ago

What would differ from system-to-system? I think the FF generation depends just on number of atoms, and/or grid size - and we show both on the x-axis.

Yes the FF generation depends on the number of atoms (N) and the number of grid points (M), but the dependence is different for the different FF models. For the Lennard-Jones the scaling is O(NM), but for the FFT-based Hartree and Pauli calculations it's O(Mlog(M)), which importantly is independent of the number of atoms.

The problem is that the number of atoms is not in general a linear function of the grid size, which it is in this case because the size of the graphene sheet grows at the same rate as the number of grid points as the cell size gets larger. For the largest cell size here if you take into account the periodic copies is actually N=~10k. But if you have a non-periodic system then you could have much less atoms (e.g. N=~100) for the same number of grid points, because you want to leave some empty space on the sides, so then the relative difference between the LJ-calculation and FFT could be much different.

ProkopHapala commented 4 months ago

Yes the FF generation depends on the number of atoms (N) and the number of grid points (M), but the dependence is different for the different FF models. For the Lennard-Jones the scaling is O(NM), but for the FFT-based Hartree and Pauli calculations it's O(Mlog(M)), which importantly is independent of the number of atoms.

yes, that is why I think it is good to show also figure 1 (to compare scaling of different methods), and explain this in the text (with big-O scalling formulas)

The problem is that the number of atoms is not in general a linear function of the grid size, which it is in this case because the size of the graphene sheet grows at the same rate as the number of grid points as the cell size gets larger. For the largest cell size here if you take into account the periodic copies is actually N=~10k. But if you have a non-periodic system then you could have much less atoms (e.g. N=~100) for the same number of grid points, because you want to leave some empty space on the sides, so then the relative difference between the LJ-calculation and FFT could be much different.

yes, I would explain this in the text, that we choose graphene as text example because number of atoms scales linearly with number of grid point, so we can use just 1 x-axis (otherwise we woul need show dependence on atoms and on grid separately)

Now I'm thinking FDBM on 32 CPU cores is just meassuring fftw library used by numpy ?
- We do not measure our OpenMP accelrated CPU code at all ?

ondrejkrejci commented 4 months ago

Hi all, I rather agree with @ProkopHapala - that both of the images are actually better than a single one. I have created a temporary image, so we can adjust the text for the reply about it. I will hopefully work about that one tonight.

It is in overleaf >images>ppafm_spead.pdf

As for the conversation that you have no. I think that the graphene example, is better than what we have in the text now:

The full simulation (force- field grid generation and tip relaxation) takes ∼ 0.025 s using the LJ+PC force field, ∼0.066 s using LJ+Hartree, and ∼0.12 s using FDBM. Of the simulation time, 0.01 s ∼ 0.02 s goes into the relaxation step, meaning that most of the time is typically spent on the force field calculation, aside from small systems using point charges. Additional time is taken loading the input files from disk and preparing arrays in the GPU memory, which actually become the bottleneck for single simulations. How- ever, these operations are amortized if a batch of simulations is run using the same grid, as is the case for example in the GUI when changing simulation parameters not related to the grid size. Here, for the phtalocyanine molecule the most time- consuming part of the simulation is the FFT-convolution, but in systems with a large number of atoms (e.g. a periodic surface slab) the Lennard-Jones calculation can become just as expen- sive or even more so, reducing the difference in computational cost between the different force field models.

So you could guys prepare some text about it, and \sout{} the old one, and put everything in {\color{red} CHANGED TEXT}.

ondrejkrejci commented 4 months ago

Ok, so the image example is here: !ppafm_speed.pdf

and this is a current version of the text:

Figure 7: Examples of scaling and speed performance of the PPAFM code on various processing units:. Both of the images (a-b) are showing execution time of the whole CO-tip AFM simulation above a graphene sheet of differ- ent sizes.) (a) Illustrates the execution time, for three diffreent levels of the- ory depending on the size of the system: Lennard-Jones potential with point- charges electrostatics (LJ + PC), Lennard-Jones potential + electrostatics from the Hartree potential (LJ+Hartree) and the full density based method (FDBM). These calculations were proceeded on AMD RX 6700 XT GPU. (b) Compari- son of the FDBM method no CPU and various GPUs, depending on the size of the graphene sheet. The used CPU is Intel i9-13900K, and the GPUs are: laptop – Nvidia GTX 1650Ti Mobile, desktop – AMD RX 6700 XT and server – Nvidia A100. The image nicely illustrates the huge speed-up of the whole procedure by the employment of GPUs. The speed of calculations on GPUs of really small systems is not depending only on the speed of the GPU, but also other factors, like memory access time. Finally the CPU is scaling is also not directly linear, as some parts of the procedure, like Fast Furrier Transform are not parallelized.

@NikoOinonen - would you be so kind to provide the version of the first image also with the grid lines, so they are of the same style?

What do you guys think about the text? -> Feel free to change.

Maybe we can make the (a) and (b) font smaller, but that is a small thing.

NikoOinonen commented 4 months ago

Now I'm thinking FDBM on 32 CPU cores is just meassuring fftw library used by numpy ? We do not measure our OpenMP accelrated CPU code at all ?

The FDBM does have the D3 vdW calculation also in it which does make use of the OpenMP acceleration. And actually it's the single most time-consuming part of the calculation in this case, which is why the execution time does go down with the number of cores. But the at 32 cores the FFT is already taking about half of the total time, so the scaling is not as good.

NikoOinonen commented 4 months ago

Here is a version with grid lines in both plots. The top plot is a little bit different from the original, because I realized that I used the wrong GPU. Now the GPU in the top plot is the same as the "GPU (desktop)" in the bottom plot.

@ondrejkrejci @ProkopHapala Are we happy with this?

benchmark_combined

ProkopHapala commented 4 months ago

The FDBM does have the D3 vdW calculation also in it which does make use of the OpenMP acceleration. And actually it's the single most time-consuming part of the calculation in this case, which is why the execution time does go down with the number of cores. But the at 32 cores the FFT is already taking about half of the total time, so the scaling is not as good.

Aha, interesting, do you know if FFT libs are using just single thread or they are themself paralezied (but not controled by our OMP flags ?)

Perhaps it would be good to see detailed partitioning of execution time between D3 and FFT part of Force-Field calclation, but perhaps just for us, not for publication.

ondrejkrejci commented 4 months ago

Great @NikoOinonen ! Updated both on the manuscript and the response. Now what about the caption:

Figure 7: Examples of scaling and speed performance of the PPAFM code on various processing units:. Both of the images (a-b) are showing execution time of the whole CO-tip AFM simulation above a graphene sheet of differ- ent sizes.) (a) Illustrates the execution time, for three diffreent levels of the- ory depending on the size of the system: Lennard-Jones potential with point- charges electrostatics (LJ + PC), Lennard-Jones potential + electrostatics from the Hartree potential (LJ+Hartree) and the full density based method (FDBM). These calculations were proceeded on AMD RX 6700 XT GPU. (b) Compari- son of the FDBM method no CPU and various GPUs, depending on the size of the graphene sheet. The used CPU is Intel i9-13900K, and the GPUs are: laptop – Nvidia GTX 1650Ti Mobile, desktop – AMD RX 6700 XT and server – Nvidia A100. The image nicely illustrates the huge speed-up of the whole procedure by the employment of GPUs. The speed of calculations on GPUs of really small systems is not depending only on the speed of the GPU, but also other factors, like memory access time. Finally the CPU is scaling is also not directly linear, as some parts of the procedure, like Fast Furrier Transform are not parallelized

And about the 5.2.2 text:

The full simulation (force- field grid generation and tip relaxation) takes ∼ 0.025 s using the LJ+PC force field, ∼0.066 s using LJ+Hartree, and ∼0.12 s using FDBM. Of the simulation time, 0.01 s ∼ 0.02 s goes into the relaxation step, meaning that most of the time is typically spent on the force field calculation, aside from small systems using point charges. Additional time is taken loading the input files from disk and preparing arrays in the GPU memory, which actually become the bottleneck for single simulations. How- ever, these operations are amortized if a batch of simulations is run using the same grid, as is the case for example in the GUI when changing simulation parameters not related to the grid size. Here, for the phtalocyanine molecule the most time- consuming part of the simulation is the FFT-convolution, but in systems with a large number of atoms (e.g. a periodic surface slab) the Lennard-Jones calculation can become just as expen- sive or even more so, reducing the difference in computational cost between the different force field models.

And for @ProkopHapala:

Perhaps it would be good to see detailed partitioning of execution time between D3 and FFT part of Force-Field calclation, ? but perhaps just for us, not for publication.

Agree with that, I would not add a huge amount of details, people can ask us, or we can write in our documentation. The speed-up and speed vs size is what I thing is the most important information for the manuscript.

NikoOinonen commented 4 months ago

Aha, interesting, do you know if FFT libs are using just single thread or they are themself paralezied (but not controled by our OMP flags ?)

The FFT execution time seemed to be constant regardless of number of threads, so not parallelized at the moment. I don't know if there is an option for that.

Perhaps it would be good to see detailed partitioning of execution time between D3 and FFT part of Force-Field calclation, but perhaps just for us, not for publication.

For the largest graphene size with 32 threads, the breakdown was like this:

Pauli: 9.224506776197813s
Electrostatics: 7.773861776200647s
D3 vdW: 20.93192961000168s
Tip relaxation: 1.220511164799973s
Total: 39.150809327200115s

I don't have the data for the other number of threads at the moment because the files got overwritten. I can rerun them if you want.

ondrejkrejci commented 4 months ago

Hi both @NikoOinonen and @ProkopHapala - I have left my notes at home today, so I have went through this task, to finish it. I have made some minor changes (grammatics, doubnle usage of words and sentences separation). Please have a look. I would personally remove this sentence:

Replacing the NumPy FFT by a different better-parallelized FFT library is a simple way to further improve performance on CPU in the future.

as none of the users would do that. It is more an inner knowledge for developers, but I do not think that it is critical in any way.

I have copied the text to the referees report as well.

If you think that this is fine, then I would close the issue.

NikoOinonen commented 4 months ago

I would personally remove this sentence:

Replacing the NumPy FFT by a different better-parallelized FFT library is a simple way to further improve performance on CPU in the future.

as none of the users would do that. It is more an inner knowledge for developers, but I do not think that it is critical in any way.

I think the point of this was just to acknowledge that there is a rather easy way to make the CPU computation faster, but we just haven't implemented it yet. But it's not a crucial part, so I'm fine either way.

ProkopHapala commented 4 months ago

Replacing the NumPy FFT by a different better-parallelized FFT library is a simple way to further improve performance on CPU in the future.

as none of the users would do that. It is more an inner knowledge for developers, but I do not think that it is critical in any way.

I did not ment users should do it. But that it is outlook for future development.

I don't insist to keep it, but why you want to remove it? Just to make it shorter, or you find it problematic?

ondrejkrejci commented 4 months ago

I don't insist to keep it, but why you want to remove it? Just to make it shorter, or you find it problematic?

No, it is just a personal preference, of that I would not put it there. I am fine with it both ways.

If you both have went through that one. I am closing this issue. If @NikoOinonen or @ProkopHapala want to use the numbers above or have something that you want to discuss in this direction, feel free to reopen it.

Probe-Particle / ppafm

Code benchmark #282