Closed mikaem closed 8 years ago
Added two missing figures. Hope that does the trick.
It now builds fine ! Thanks !
Dear Mikael,
your paper is well-written in general and easy to read.
There are a few minor editing comments I would like to make. One major comment relates to the comparison of the Python solver with the C++ implementation. I appreciate the comparison, but I would like to comment that a better baseline would be the mentioned DNS production-quality solver (or a different, but highly tuned DNS solver of the same grade) and a corresponding discussion of the C++ and Python performance w.r.t. to the production code. Without that it is hard to judge if the C++ delivers a reasonable level of performance.
Minor comments:
import
statements and all definitionsI hope this review is helpful. If there are questions, please let me know.
Kind regards, -michael
Dear Michael
Thank you very much for the feedback and sorry about the late reply. I have made all the corrections that you address and updated the PR.
I appreciate your major comment about the C++ implementation. Unfortunately I do not have access to a C++ production code that I know is high quality and highly tuned. Nevertheless I have actually run some comparisons with open source C++ solvers like Tarang (http://arxiv.org/pdf/1203.5301.pdf) and found that my Cython/Python and C++ codes were usually at least 10-20 % faster. I did not really find the need to make this comparison in the paper since I do not know the people developing Tarang and there may be factors in the installation relevant to speed that I’m simply not aware of. Consider Figure 2 of (http://arxiv.org/pdf/1203.5301.pdf). Tarang is here running on the same computer as I and at 50 seconds per time step for a box of size 1024^3 on 512 processors. My results in the paper are not showing exactly the same case, but for the same box (1024^3) with twice as many cores (1024), my code runs at app. 20 seconds per time step (see fig 2, the slab decomp with largest mesh). Assuming perfect scaling that would correspond to 40 seconds on 512 cores which is 20 % faster than Tarang. Since you did bring it up I have now put this weak link into the paper for people to see, but I have not made any direct comparisons. That should be ok since it leaves out the uncertainty of installation, and it is the same computer.
Best regards
Mikael
@NelleV I can review this, is this needed?
Yes please.
Dear Mikael,
The paper is well written and balances well the technical details with the principles of the simulation work. The comments are thus rather focused. Also, they do not challenge the content of the article but are mostly aimed at clarifying or adding detail.
missing "and": page 2, section 3.3
The regular Python modules `numpy.fft`, `scipy.fftpack`, [pyfftw]_ all
shoud read
The regular Python modules `numpy.fft`, `scipy.fftpack` and [pyfftw]_ all
weird phrasing on page 4, section 5. The sentence
indicating that the MPI communications are performing at the level with C++
mixes "at the level of" and "with".
Overall, the paper is good and I am confident that it will reach a final version in a reasonable time :-)
Regards,
Pierre
Dear Pierre,
Thank you for your comments and for your time. I address your comments below and I have also updated the pull request.
The ordering is row major C and I believe optimal for this code that relies heavily on NumPy ufuncs. Consider the implementation of the cross product
def cross(c, a, b):
"""Regular c = a x b"""
c[0] = a[1]*b[2] - a[2]*b[1]
c[1] = a[2]*b[0] - a[0]*b[2]
c[2] = a[0]*b[1] - a[1]*b[0]
return c
Here $a[1]b[2]$ is a *NumPy ufunc that automatically loops over all trailing indices $a[1, :, :, :]*b[2, :, :, :]$. What's more, the loop is over a contiguous memory block. If we reverse and place the "x,y,z" last by creating the data like, e.g., $a = zeros([N, N, N, 3])$, then the code becomes
def cross(c, a, b):
c[..., 0] = a[..., 1]*b[..., 2] - a[..., 2]*b[..., 1]
c[..., 1] = a[..., 2]*b[..., 0] - a[..., 0]*b[..., 2]
c[..., 2] = a[..., 0]*b[..., 1] - a[..., 1]*b[..., 0]
return c
and the NumPy ufuncs, like $a[...,1]*b[...,2]$ are now longer looping over contiguous memory blocks. To verify I have created this gist. A fortran implementation would probably be faster using this last option due to the coloumn major storage.
Dear Mikael,
Thank you for the update. Referring to the "scientific Python" solver is a good solution.
The memory ordering is probably nitpicking from me. With respect to points 4 and 6, I believe that they should be incorporated in the paper (this is the reason I actually asked the questions).
Modulo these additions, I am +1 on the paper.
Regards,
Pierre
Dear Pierre,
Thank you for for final comments. A new update has been pushed with final additions.
Best regards
Mikael
Dear Mikael,
Thanks for the fast update. @NelleV this is ok for me.
Dear Mikael,
There is a image missing: weak.png. Can you please commit it?
Many thanks, N