Closed GiannisA closed 9 years ago
Thank you for the submission!
This is the same submission as #24 . Which one should we review? Thanks, N
Hello, The one submitted yesterday is the one to review. It is similar to the previous, but with several corrections. Thank you Giannis
On 13 August 2014 17:15, Varoquaux notifications@github.com wrote:
This is the same submission as #24 https://github.com/euroscipy/euroscipy_proceedings/pull/24 . Which one should we review? Thanks, N
— Reply to this email directly or view it on GitHub https://github.com/euroscipy/euroscipy_proceedings/pull/28#issuecomment-52062588 .
I will be reviewing this paper.
Here it is my review:
Well written article. It clearly exposes the problem to solve and the different approaches followed to minimize the required computational time, first using NumPy, then Cython, then parallel Cython and finally PyOpenCL, which I agree is a nice optimization path.
I also liked the experimentation with a couple of data structures (a LUT table and a CSR matrix) to cope with the problem. Often times trying different structures is overlooked, so I am glad that this is not the case.
Some comments:
Dear Francesc, Thank you for the time spend reviewing this proceedings paper. Me and Jerome have gone through all the points of your review and corrected the paper accordingly:
-
In section 3.4.1 (OpenMP support in Cython) authors said that the speed-up when going from 4 threads to 8 threads is very small, but it turns out that the speed-up from 2 to 4 threads is even smaller (1.44x vs 1.37x). OTOH, I would not say that a 37% or 44% of speed-up is 'very small', but rather 'normal' when talking about parallel speed-ups. Given this, maybe it would be fair to say something like 'beyond 2 threads the speed-up is not too high'.
In section 6.2 (OpenMP vs OpenCL) the authors said "the OpenCL implementation outperforms the OpenMP one [probably because of the] use of single precision with Kahan summation (in OpenCL) which is more efficient than double precision arithmetic", but are not say anything about the reason why is this, but it would be worth mentioning that the cause is probably that single precision requires less data to be transmitted to the CPU. At any rate, in order to compare pears with pears it would have been nice to implement a single precision version of the OpenMP version.
Table 2 shows a pretty detailed description of the hardware used. However, IMO, memory bandwidth is a very important parameter to have in mind when measuring performance, so I think it would be useful to add this to the table as well.
In the paper there is no mention about the amount of energy consumed by the different approaches, a factor which is becoming more and more interesting in HPC. It would be nice to have an asessment of it (but I recognize that currently it is not easy to measure that for every platform).
In section 6.4 (Kernel timings), authors said "even at 2 ms processing per image, few hard-drives are able to deliver the requested Gigabytes per second of data this represents". Maybe it is worth mentioning how many bits per pixel have these images.
A minor glitch. The last sentence of section 8 ends with a comma.
Has been corrected
Once again, thank you for your time. Best regards, Giannis Ashiotis
On 26 August 2014 12:20, FrancescAlted notifications@github.com wrote:
Here it is my review:
Well written article. It clearly exposes the problem to solve and the different approaches followed to minimize the required computational time, first using NumPy, then Cython, then parallel Cython and finally PyOpenCL, which I agree is a nice optimization path.
I also liked the experimentation with a couple of data structures (a LUT table and a CSR matrix) to cope with the problem. Often times trying different structures is overlooked, so I am glad that this is not the case.
Some comments:
-
In section 3.4.1 (OpenMP support in Cython) authors said that the speed-up when going from 4 threads to 8 threads is very small, but it turns out that the speed-up from 2 to 4 threads is even smaller (1.44x vs 1.37x). OTOH, I would not say that a 37% or 44% of speed-up is 'very small', but rather 'normal' when talking about parallel speed-ups. Given this, maybe it would be fair to say something like 'beyond 2 threads the speed-up is not too high'.
In section 6.2 (OpenMP vs OpenCL) the authors said "the OpenCL implementation outperforms the OpenMP one [probably because of the] use of single precision with Kahan summation (in OpenCL) which is more efficient than double precision arithmetic", but are not say anything about the reason why is this, but it would be worth mentioning that the cause is probably that single precision requires less data to be transmitted to the CPU. At any rate, in order to compare pears with pears it would have been nice to implement a single precision version of the OpenMP version.
Table 2 shows a pretty detailed description of the hardware used. However, IMO, memory bandwidth is a very important parameter to have in mind when measuring performance, so I think it would be useful to add this to the table as well.
In the paper there is no mention about the amount of energy consumed by the different approaches, a factor which is becoming more and more interesting in HPC. It would be nice to have an asessment of it (but I recognize that currently it is not easy to measure that for every platform).
In section 6.4 (Kernel timings), authors said "even at 2 ms processing per image, few hard-drives are able to deliver the requested Gigabytes per second of data this represents". Maybe it is worth mentioning how many bits per pixel have these images.
A minor glitch. The last sentence of section 8 ends with a comma.
— Reply to this email directly or view it on GitHub https://github.com/euroscipy/euroscipy_proceedings/pull/28#issuecomment-53400898 .
Excellent. Thanks for your rapid response!
Hi,
I will review your contribution.
The paper is well written, focused on the computational efficiency of an algorithm whose context is presented clearly.
The evolution of the project between NumPy, OpenMP, Cython and OpenCL is of broad interest and the present contribution is written in a way that it will reach an audience beyond the crystallography community.
Apart from the style and typo issues that are reported on the specific lines, here are general comments to be addressed:
Dear @FrancescAlted , thank you for the review.
Dear @GiannisA and @kif , please address the remaining (minor comments), ideally your updates should arrive for mid-october.
Regards, Pierre
Hi Pierre,
Would it be OK if we make those 2 figures one on top of the other instead of side-by-side ? This is likely to make the article longer but if you agree it would be perfect.
For the second point, you are perfectly right, our processing takes about 2ms (from Python call to the answer in Python, all transfers included. The detector we are referring to are described here (the link may change...): https://www.dectris.com/specifications-279.html And the Pilatus3-1M has a maximum frequency of about 500Hz, 2ms read-out time (i.e. from the camera to the memory of the computer...). As this detector has a 20 bits dynamic range, it writes data as signed integer 32 bits into memory which actually makes 2GB/s. Would you like us to write down all those figures ?
Maybe our table 3 was not clear: The full treatment is obtained from 8 operations: 1 upload, 4 kernel executions and 3 download which can be profiled independently. The total time measured by OpenCL on the device is 1.4ms while the time measured using "timeit" is 2ms, which allows us to evaluated the overhead of the Python call. We didn't want to enter too much into the details of what each kernel was doing as it may become very technical. But if you agree we add a couple of sentences explaining better what is going on.
Giannis will integrate the modification as he has a bit of time... I am busy releasing pyFAI :)
Thanks to you two for the review and the comments.
Hi Jérôme,
No problem for the change in figure.
For the processing time and data throughput, I think that adding the figures is better as it will allow the curious readers to check that they have them right.
Table 3 does not need the full details but could benefit from a few: state that ai.integrate1d
is the Python routine and that its timing includes that complete process, that the H
and D
stand for host and device (if this is correct indeed) when a memory transfer occurs and that the rest are compute kernels. "what each kernel was doing" is not needed.
I don't know for when is the release due but on our side we expect to publish the proceedings by the end of october / beginning of november. If you want to update the manuscript to mention the release before the publication let us know.
Pierre
On Thu, 02 Oct 2014 00:03:58 -0700 Pierre de Buyl notifications@github.com wrote:
I don't know for when is the release due but on our side we expect to publish the proceedings by the end of october / beginning of november. If you want to update the manuscript to mention the release before the publication let us know.
Hi Pierre,
The release is planned so that it could enter Debian-8, so the release will be within the next week. But we will not mention it because it would imply too re-do all benchmarks and some computers are no more available or have seen their configuration changed.
Jerome Kieffer Jerome.Kieffer@terre-adelie.org
Hello, I think Jerome and me have addressed all of your comments. The paper should be ready now. Cheers Giannis
I will review this.
I am unable to build the PDF from GiannisA's fork.
Do you want us to provide a build ? I had no issue to build them with Debian - Jessie.
Hi @dpsanders in absence of update from you I performed the review myself. If you want to read the paper anyway a pdf of all submissions (draft, thus), is available at http://pdebuyl.be/tmp/esp2014_draft.pdf
On Sat, Oct 11, 2014 at 3:35 PM, Pierre de Buyl notifications@github.com wrote:
Hi @dpsanders https://github.com/dpsanders in absence of update from you I performed the review myself. If you want to read the paper anyway a pdf of all submissions (draft, thus), is available at http://pdebuyl.be/tmp/esp2014_draft.pdf
OK, thanks and apologies.
Dear Giannis and Jérôme,
In Fig. 3, please fix the right/left indication in the caption to bottom/top. In the caption of table 3, please fix the capitalization: "OpenCl" to "OpenCL".
Else, the paper is ok for me!
Regards,
Pierre
Hi Giannis,
I made the modifications (in my fork) but they need to be merged by you and pushed.
Jerome Kieffer Jerome.Kieffer@terre-adelie.org
Greetings to all, The changes have been incorporated. Ready for merging
Hello, I am creating a pull request for the paper for the EuroSciPy 2014 proceedings on behalf of the main author Jerome Kieffer and my self, Giannis Ashiotis Thank you