Paper: PyFAI: a Python library for high performance azimuthal integration on GPU (J. Kieffer - G. Ashiotis)

GiannisA commented 10 years ago

Hello, I am creating a pull request for the paper for the EuroSciPy 2014 proceedings on behalf of the main author Jerome Kieffer and my self, Giannis Ashiotis Thank you

NelleV commented 10 years ago

Thank you for the submission!

NelleV commented 10 years ago

This is the same submission as #24 . Which one should we review? Thanks, N

GiannisA commented 10 years ago

Hello, The one submitted yesterday is the one to review. It is similar to the previous, but with several corrections. Thank you Giannis

On 13 August 2014 17:15, Varoquaux notifications@github.com wrote:

This is the same submission as #24 https://github.com/euroscipy/euroscipy_proceedings/pull/24 . Which one should we review? Thanks, N

— Reply to this email directly or view it on GitHub https://github.com/euroscipy/euroscipy_proceedings/pull/28#issuecomment-52062588 .

FrancescAlted commented 10 years ago

I will be reviewing this paper.

FrancescAlted commented 10 years ago

Here it is my review:

Well written article. It clearly exposes the problem to solve and the different approaches followed to minimize the required computational time, first using NumPy, then Cython, then parallel Cython and finally PyOpenCL, which I agree is a nice optimization path.

I also liked the experimentation with a couple of data structures (a LUT table and a CSR matrix) to cope with the problem. Often times trying different structures is overlooked, so I am glad that this is not the case.

Some comments:

In section 3.4.1 (OpenMP support in Cython) authors said that the speed-up when going from 4 threads to 8 threads is very small, but it turns out that the speed-up from 2 to 4 threads is even smaller (1.44x vs 1.37x). OTOH, I would not say that a 37% or 44% of speed-up is 'very small', but rather 'normal' when talking about parallel speed-ups. Given this, maybe it would be fair to say something like 'beyond 2 threads the speed-up is not too high'.
In section 6.2 (OpenMP vs OpenCL) the authors said "the OpenCL implementation outperforms the OpenMP one [probably because of the] use of single precision with Kahan summation (in OpenCL) which is more efficient than double precision arithmetic", but are not say anything about the reason why is this, but it would be worth mentioning that the cause is probably that single precision requires less data to be transmitted to the CPU. At any rate, in order to compare pears with pears it would have been nice to implement a single precision version of the OpenMP version.
Table 2 shows a pretty detailed description of the hardware used. However, IMO, memory bandwidth is a very important parameter to have in mind when measuring performance, so I think it would be useful to add this to the table as well.
In the paper there is no mention about the amount of energy consumed by the different approaches, a factor which is becoming more and more interesting in HPC. It would be nice to have an asessment of it (but I recognize that currently it is not easy to measure that for every platform).
In section 6.4 (Kernel timings), authors said "even at 2 ms processing per image, few hard-drives are able to deliver the requested Gigabytes per second of data this represents". Maybe it is worth mentioning how many bits per pixel have these images.
A minor glitch. The last sentence of section 8 ends with a comma.

GiannisA commented 10 years ago

Dear Francesc, Thank you for the time spend reviewing this proceedings paper. Me and Jerome have gone through all the points of your review and corrected the paper accordingly:

-

In section 3.4.1 (OpenMP support in Cython) authors said that the speed-up when going from 4 threads to 8 threads is very small, but it turns out that the speed-up from 2 to 4 threads is even smaller (1.44x vs 1.37x). OTOH, I would not say that a 37% or 44% of speed-up is 'very small', but rather 'normal' when talking about parallel speed-ups. Given this, maybe it would be fair to say something like 'beyond 2 threads the speed-up is not too high'.

You are right about the speedups. We've made appropriate changes.

In section 6.2 (OpenMP vs OpenCL) the authors said "the OpenCL implementation outperforms the OpenMP one [probably because of the] use of single precision with Kahan summation (in OpenCL) which is more efficient than double precision arithmetic", but are not say anything about the reason why is this, but it would be worth mentioning that the cause is probably that single precision requires less data to be transmitted to the CPU. At any rate, in order to compare pears with pears it would have been nice to implement a single precision version of the OpenMP version.

Following your advice we have made tests with a single precision OpenMP version of the code, and discovered that there were no clear evidence to support that statement, so it was removed.

Table 2 shows a pretty detailed description of the hardware used. However, IMO, memory bandwidth is a very important parameter to have in mind when measuring performance, so I think it would be useful to add this to the table as well.

We have calculated and added the theoretical GB/s memory bandwidth for he devices used

In the paper there is no mention about the amount of energy consumed by the different approaches, a factor which is becoming more and more interesting in HPC. It would be nice to have an asessment of it (but I recognize that currently it is not easy to measure that for every platform).

At the time of the conference we were unable to perform such tests. Of course such test are of interest and could be scheduled for the future, at least at the platforms that this is possible.

In section 6.4 (Kernel timings), authors said "even at 2 ms processing per image, few hard-drives are able to deliver the requested Gigabytes per second of data this represents". Maybe it is worth mentioning how many bits per pixel have these images.

Added information on that at the beginning of the Benchmarks paragraph.

A minor glitch. The last sentence of section 8 ends with a comma.

Has been corrected

Once again, thank you for your time. Best regards, Giannis Ashiotis

On 26 August 2014 12:20, FrancescAlted notifications@github.com wrote:

Here it is my review:

Well written article. It clearly exposes the problem to solve and the different approaches followed to minimize the required computational time, first using NumPy, then Cython, then parallel Cython and finally PyOpenCL, which I agree is a nice optimization path.

I also liked the experimentation with a couple of data structures (a LUT table and a CSR matrix) to cope with the problem. Often times trying different structures is overlooked, so I am glad that this is not the case.

Some comments:

-

In section 3.4.1 (OpenMP support in Cython) authors said that the speed-up when going from 4 threads to 8 threads is very small, but it turns out that the speed-up from 2 to 4 threads is even smaller (1.44x vs 1.37x). OTOH, I would not say that a 37% or 44% of speed-up is 'very small', but rather 'normal' when talking about parallel speed-ups. Given this, maybe it would be fair to say something like 'beyond 2 threads the speed-up is not too high'.

In section 6.2 (OpenMP vs OpenCL) the authors said "the OpenCL implementation outperforms the OpenMP one [probably because of the] use of single precision with Kahan summation (in OpenCL) which is more efficient than double precision arithmetic", but are not say anything about the reason why is this, but it would be worth mentioning that the cause is probably that single precision requires less data to be transmitted to the CPU. At any rate, in order to compare pears with pears it would have been nice to implement a single precision version of the OpenMP version.

Table 2 shows a pretty detailed description of the hardware used. However, IMO, memory bandwidth is a very important parameter to have in mind when measuring performance, so I think it would be useful to add this to the table as well.

In the paper there is no mention about the amount of energy consumed by the different approaches, a factor which is becoming more and more interesting in HPC. It would be nice to have an asessment of it (but I recognize that currently it is not easy to measure that for every platform).

In section 6.4 (Kernel timings), authors said "even at 2 ms processing per image, few hard-drives are able to deliver the requested Gigabytes per second of data this represents". Maybe it is worth mentioning how many bits per pixel have these images.

A minor glitch. The last sentence of section 8 ends with a comma.

— Reply to this email directly or view it on GitHub https://github.com/euroscipy/euroscipy_proceedings/pull/28#issuecomment-53400898 .

FrancescAlted commented 10 years ago

Excellent. Thanks for your rapid response!

pdebuyl commented 10 years ago

Hi,

I will review your contribution.

pdebuyl commented 10 years ago

The paper is well written, focused on the computational efficiency of an algorithm whose context is presented clearly.

The evolution of the project between NumPy, OpenMP, Cython and OpenCL is of broad interest and the present contribution is written in a way that it will reach an audience beyond the crystallography community.

Apart from the style and typo issues that are reported on the specific lines, here are general comments to be addressed:

The size of the text in figures 2 and 3 (critical) is too small.
Even though it has been commented on by Francesc Alted, please clarify the rate of data. Table 3 states that a 1Mpx image is treated in 2ms. Assuming that the 1Mpx image is 32bit per pixel, this gives a maximum rate of 1Mpx*4bytes/2ms = 2GB/s. Is that correct? May you give a few absolute numbers in the discussion?
In table 3: intergate1d -> integrate1d and OpenCl -> OpenCL
Tables 1 and 2 are not referenced in the text.

pdebuyl commented 10 years ago

Dear @FrancescAlted , thank you for the review.

Dear @GiannisA and @kif , please address the remaining (minor comments), ideally your updates should arrive for mid-october.

Regards, Pierre

kif commented 9 years ago

Hi Pierre,

Would it be OK if we make those 2 figures one on top of the other instead of side-by-side ? This is likely to make the article longer but if you agree it would be perfect.

For the second point, you are perfectly right, our processing takes about 2ms (from Python call to the answer in Python, all transfers included. The detector we are referring to are described here (the link may change...): https://www.dectris.com/specifications-279.html And the Pilatus3-1M has a maximum frequency of about 500Hz, 2ms read-out time (i.e. from the camera to the memory of the computer...). As this detector has a 20 bits dynamic range, it writes data as signed integer 32 bits into memory which actually makes 2GB/s. Would you like us to write down all those figures ?

Maybe our table 3 was not clear: The full treatment is obtained from 8 operations: 1 upload, 4 kernel executions and 3 download which can be profiled independently. The total time measured by OpenCL on the device is 1.4ms while the time measured using "timeit" is 2ms, which allows us to evaluated the overhead of the Python call. We didn't want to enter too much into the details of what each kernel was doing as it may become very technical. But if you agree we add a couple of sentences explaining better what is going on.

Giannis will integrate the modification as he has a bit of time... I am busy releasing pyFAI :)

Thanks to you two for the review and the comments.

pdebuyl commented 9 years ago

Hi Jérôme,

No problem for the change in figure.

For the processing time and data throughput, I think that adding the figures is better as it will allow the curious readers to check that they have them right.

Table 3 does not need the full details but could benefit from a few: state that ai.integrate1d is the Python routine and that its timing includes that complete process, that the H and D stand for host and device (if this is correct indeed) when a memory transfer occurs and that the rest are compute kernels. "what each kernel was doing" is not needed.

I don't know for when is the release due but on our side we expect to publish the proceedings by the end of october / beginning of november. If you want to update the manuscript to mention the release before the publication let us know.

Pierre

kif commented 9 years ago

On Thu, 02 Oct 2014 00:03:58 -0700 Pierre de Buyl notifications@github.com wrote:

I don't know for when is the release due but on our side we expect to publish the proceedings by the end of october / beginning of november. If you want to update the manuscript to mention the release before the publication let us know.

Hi Pierre,

The release is planned so that it could enter Debian-8, so the release will be within the next week. But we will not mention it because it would imply too re-do all benchmarks and some computers are no more available or have seen their configuration changed.

Cheers,

Jerome Kieffer Jerome.Kieffer@terre-adelie.org

GiannisA commented 9 years ago

Hello, I think Jerome and me have addressed all of your comments. The paper should be ready now. Cheers Giannis

dpsanders commented 9 years ago

I will review this.

dpsanders commented 9 years ago

I am unable to build the PDF from GiannisA's fork.

kif commented 9 years ago

Do you want us to provide a build ? I had no issue to build them with Debian - Jessie.

pdebuyl commented 9 years ago

Hi @dpsanders in absence of update from you I performed the review myself. If you want to read the paper anyway a pdf of all submissions (draft, thus), is available at http://pdebuyl.be/tmp/esp2014_draft.pdf

dpsanders commented 9 years ago

On Sat, Oct 11, 2014 at 3:35 PM, Pierre de Buyl notifications@github.com wrote:

Hi @dpsanders https://github.com/dpsanders in absence of update from you I performed the review myself. If you want to read the paper anyway a pdf of all submissions (draft, thus), is available at http://pdebuyl.be/tmp/esp2014_draft.pdf

OK, thanks and apologies.

pdebuyl commented 9 years ago

Dear Giannis and Jérôme,

In Fig. 3, please fix the right/left indication in the caption to bottom/top. In the caption of table 3, please fix the capitalization: "OpenCl" to "OpenCL".

Else, the paper is ok for me!

Regards,

Pierre

kif commented 9 years ago

Hi Giannis,

I made the modifications (in my fork) but they need to be merged by you and pushed.

Cheers,

Jerome Kieffer Jerome.Kieffer@terre-adelie.org

GiannisA commented 9 years ago

Greetings to all, The changes have been incorporated. Ready for merging

euroscipy / euroscipy_proceedings