[Paper+1] "Massively parallel implementation in Python of a pseudo-spectral DNS code for turbulent flows", Mikael Mortensen

NelleV commented 9 years ago

Dear Mikael,

There is a image missing: weak.png. Can you please commit it?

Many thanks, N

mikaem commented 9 years ago

Added two missing figures. Hope that does the trick.

NelleV commented 9 years ago

It now builds fine ! Thanks !

mjklemm commented 8 years ago

Dear Mikael,

your paper is well-written in general and easy to read.

There are a few minor editing comments I would like to make. One major comment relates to the comparison of the Python solver with the C++ implementation. I appreciate the comparison, but I would like to comment that a better baseline would be the mentioned DNS production-quality solver (or a different, but highly tuned DNS solver of the same grade) and a corresponding discussion of the C++ and Python performance w.r.t. to the production code. Without that it is hard to judge if the C++ delivers a reasonable level of performance.

Minor comments:

Please use proper naming for NumPyg, Cython, SciPy, etc. in the text
Grammar of the sentence between Eq.5 and Eq.6 is broken
Maybe add a reference to the RK method.
First sentence in Section 3: "is" should be "are"
Any specific reason why you've chosen option 1 in Section 3.2?
The example on p.3 could be improved by adding import statements and all definitions
P. 4, first line: "a factor 5" should be "a factor of".
Section 5 could mention some future work

I hope this review is helpful. If there are questions, please let me know.

Kind regards, -michael

mikaem commented 8 years ago

Dear Michael

Thank you very much for the feedback and sorry about the late reply. I have made all the corrections that you address and updated the PR.

I appreciate your major comment about the C++ implementation. Unfortunately I do not have access to a C++ production code that I know is high quality and highly tuned. Nevertheless I have actually run some comparisons with open source C++ solvers like Tarang (http://arxiv.org/pdf/1203.5301.pdf) and found that my Cython/Python and C++ codes were usually at least 10-20 % faster. I did not really find the need to make this comparison in the paper since I do not know the people developing Tarang and there may be factors in the installation relevant to speed that I’m simply not aware of. Consider Figure 2 of (http://arxiv.org/pdf/1203.5301.pdf). Tarang is here running on the same computer as I and at 50 seconds per time step for a box of size 1024^3 on 512 processors. My results in the paper are not showing exactly the same case, but for the same box (1024^3) with twice as many cores (1024), my code runs at app. 20 seconds per time step (see fig 2, the slab decomp with largest mesh). Assuming perfect scaling that would correspond to 40 seconds on 512 cores which is 20 % faster than Tarang. Since you did bring it up I have now put this weak link into the paper for people to see, but I have not made any direct comparisons. That should be ok since it leaves out the uncertainty of installation, and it is the same computer.

Best regards

Mikael

pdebuyl commented 8 years ago

@NelleV I can review this, is this needed?

NelleV commented 8 years ago

Yes please.

pdebuyl commented 8 years ago

Dear Mikael,

The paper is well written and balances well the technical details with the principles of the simulation work. The comments are thus rather focused. Also, they do not challenge the content of the article but are mostly aimed at clarifying or adding detail.

typo: page 1, "this in not simply" should read "this is not simply"
typo: page 2, in section 3.1: "Python implementation may" lacks an article for Python: "The Python implementation may"

missing "and": page 2, section 3.3

The regular Python modules `numpy.fft`, `scipy.fftpack`, [pyfftw]_ all

shoud read

The regular Python modules `numpy.fft`, `scipy.fftpack` and [pyfftw]_ all

Reference data from the annual International Workshop on High-Order Methods is mentioned. What datafile is being used and why is this data considered reliable?
weird phrasing on page 4, section 5. The sentence
```
indicating that the MPI communications are performing at the level with C++
```
mixes "at the level of" and "with".
A Numba implementation is presented in section 3.4 is said to be 5 times faster than the NumPy version. How does it compare to the Cython version? Also, the Numba version is not used further for the scaling results. Why?
Naming the NumPy based "pure Python" is misleading. While the logic of the code and the communication patterns are written in Python, the use of NumPy is critical to make the present work possible. Further optimisation by Cython is clearly indicated though.
In Fig. 2, the logarithmic scale make it hard to see the performance gap between the NumPy and the C++ version. It would be good to copy the values of the timing for a low value of the number of cores (2 or 8 for instance). Else, one could almost read that the NumPy (called pure Python in the manuscript) version is already on par with the C++, besides its excellent scaling.
The memory ordering of the data has the "x,y,z" component first. Thus, the information for a single node of the mesh is scattered in three locations. This gives a better locality for the Fourier transform but a worse locality for the cross product, I suppose. Could you comment on this choice? It may be obvious to someone doing DNS, which is not my situation (I do mostly particle simulations).
The code to perform the simulations is available online under the GPL license. This should be mentioned in the text.

Overall, the paper is good and I am confident that it will reach a final version in a reasonable time :-)

Regards,

Pierre

mikaem commented 8 years ago

Dear Pierre,

Thank you for your comments and for your time. I address your comments below and I have also updated the pull request.

Fixed
Fixed
Fixed
The datafile is referred to in the References section. The data is considered reliable because it has been generated by a well verified and documented pseudospectral solver that is very similar to the one described in this paper. From start to finish, over 20,000 time steps, the L2 error norm of the solution computed by my solver never strays more than 1e-6 from the reference solution.
Fixed
The numba version is more or less exactly as fast as the Cython version. The numba version is not used because numba is not installed on the Blue Gene supercomputer and I for one was not even capable of installing numba on my own laptop. I did not even try on the Blue Gene.
I agree, pure Python is somewhat misleading. However, it's not pure NumPy either, because it is just as dependent on MPI for Python and pyFFTW. Perhaps standard scientific Python would be better?
Here I have to disagree. I think the figures clearly show that both optimized, non-optimized and C++ solvers all scale well in a weak sense and the performance loss of 30-40 % is clearly stated in the text. I can agree that it is not easy to read see the 30-40 % from the figure, but that is the price we pay for using a logarithmic plot. The type of plot used is very typical for this kind of performance tests. Figure 3 clearly shows that the standard Python solvers scales poorly, something that is attributed to a for-loop in the code.
The ordering is row major C and I believe optimal for this code that relies heavily on NumPy ufuncs. Consider the implementation of the cross product
```
def cross(c, a, b):
   """Regular c = a x b"""
   c[0] = a[1]*b[2] - a[2]*b[1]
   c[1] = a[2]*b[0] - a[0]*b[2]
   c[2] = a[0]*b[1] - a[1]*b[0]
return c
```
Here $a[1]b[2]$ is a *NumPy ufunc that automatically loops over all trailing indices $a[1, :, :, :]*b[2, :, :, :]$. What's more, the loop is over a contiguous memory block. If we reverse and place the "x,y,z" last by creating the data like, e.g., $a = zeros([N, N, N, 3])$, then the code becomes
```
def cross(c, a, b):
   c[..., 0] = a[..., 1]*b[..., 2] - a[..., 2]*b[..., 1]
   c[..., 1] = a[..., 2]*b[..., 0] - a[..., 0]*b[..., 2]
   c[..., 2] = a[..., 0]*b[..., 1] - a[..., 1]*b[..., 0]
return c
```
and the NumPy ufuncs, like $a[...,1]*b[...,2]$ are now longer looping over contiguous memory blocks. To verify I have created this gist. A fortran implementation would probably be faster using this last option due to the coloumn major storage.
Fixed.

pdebuyl commented 8 years ago

Dear Mikael,

Thank you for the update. Referring to the "scientific Python" solver is a good solution.

The memory ordering is probably nitpicking from me. With respect to points 4 and 6, I believe that they should be incorporated in the paper (this is the reason I actually asked the questions).

The detail you provide here for "4" in your reply is totally adequate for the manuscript.
Not timing numba because it is hard to install is a good reason (one that I in fact suspected), as installation issues are part of the problems that come with scientific computing.

Modulo these additions, I am +1 on the paper.

Regards,

Pierre

mikaem commented 8 years ago

Dear Pierre,

Thank you for for final comments. A new update has been pushed with final additions.

Best regards

Mikael

pdebuyl commented 8 years ago

Dear Mikael,

Thanks for the fast update. @NelleV this is ok for me.

euroscipy / euroscipy_proceedings

[Paper+1] "Massively parallel implementation in Python of a pseudo-spectral DNS code for turbulent flows", Mikael Mortensen #48