[Paper] Using the pyMIC Offload Module in PyFR. Michael Klemm

mjklemm commented 9 years ago

Please get in touch with me about the copyright notice at the bottom of the first page. It needs to be slightly adapted. Thanks!

FrancescAlted commented 8 years ago

This is a very well-written article, with a detailed description of what PyFR and PyMIC are and how they work. Some feedback:

Provided that PyFR has support for CUDA/OpenCL, it would have been nice to add some benchmarks using these backends on top of NVIDIA / AMD / Intel GPUs. Just comparing Xeon Phi against a regular Xeon CPU, although informative, is not bringing an idea on how it fares against other PCIe cards (its real competitor).
Although Mako is a nice templating system, it is really more meant for HTML pages, and I find PyFR's expression templates kind of difficult to read (and hence, maintain). Perhaps using something like pyparser or PLY would have been more adequate to create a DSL template that would more readable. At any rate, this has little to do with the article, and I am simply curious on why Mako was chosen.

Other than that, and as said, great article.

mjklemm commented 8 years ago

Dear all,

Thanks for the comments on our paper. Let me take the opportunity to respond and give additional insight.

On 15.11.2015, at 08:09, FrancescAlted notifications@github.com<mailto:notifications@github.com> wrote:

This is a very well-written article, with a detailed description of what PyFR and PyMIC are and how they work. Some feedback:

Provided that PyFR has support for CUDA/OpenCL, it would have been nice to add some benchmarks using these backends on top of NVIDIA / AMD / Intel GPUs. Just comparing Xeon Phi against a regular Xeon CPU, although informative, is not bringing an idea on how it fares against other PCIe cards (its real competitor).

Unfortunately, we are legally obliged to not do such comparisons at Intel and publish these results in a research paper. So, that's why we stayed away from adding such a comparison to the paper.

Although Mako is a nice templating system, it is really more meant for HTML pages, and I find PyFR's expression templates kind of difficult to read (and hence, maintain). Perhaps using something like pyparser or PLY would have been more adequate to create a DSL template that would more readable. At any rate, this has little to do with the article, and I am simply curious on why Mako was chosen.

We will look into either adding a reference to an existing paper about PyFR that contains information about this design decision that has been made.

Other than that, and as said, great article.

Glad that you liked it :-)

Kind regards -michael

Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928

FrancescAlted commented 8 years ago

Unfortunately, we are legally obliged to not do such comparisons at Intel and publish these results in a research paper. So, that's why we stayed away from adding such a comparison to the paper.

I see. Well, as a reader, I could consider that avoiding comparisons against the competitors may not be the best policy in the sense that does not show too much confidence in your own product. But I understand that there will always be the temptation of not putting the same effort in squeezing the performance of all the platforms at stake. So I think I kind of understand Intel's position.

FrancescAlted commented 8 years ago

Thinking more about your article, there is another point that intrigues me. In your benchmark section, you reported that the transfer time between the CPU master and Xeon Phi is non-negligible. Have you thought in using compression so that this transfer time can be minimized? As we are talking about binary data, I suppose something like Blosc could be helpful (specially for inbound transfers).

mjklemm commented 8 years ago

Hi,

This is certainly an interesting suggestion. In order to be effective it would be necessary to first tune the compression and decompression routines for the 61 cores of the Xeon Phi. To the best of our knowledge no off the shelf libraries -- including Blosc -- are capable of scaling out to this level.

Our opinion, therefore, is that currently the most effective technique for minimizing the impact of data transfers is to perform them asynchronously and have them overlap with useful computation. Such a technique can be applied to a variety of algorithms and unlike compression does not consume cycles. Support for fully asynchronous transfers is on the roadmap for a future version of pyMIC.

Cheers, -michael

Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group Developer Relations Division Phone +49 89 9914 2340 Cell +49 174 2417583

From: FrancescAlted [mailto:notifications@github.com] Sent: Sunday, November 15, 2015 1:11 PM To: euroscipy/euroscipy_proceedings Cc: Klemm, Michael Subject: Re: [euroscipy_proceedings] [Paper] Using the pyMIC Offload Module in PyFR. Michael Klemm (#45)

Thinking more about your article, there is another point that intrigues me. In your benchmark section, you reported that the transfer time between the CPU master and Xeon Phi is non-negligible. Have you thought in using compression so that this transfer time can be minimized? As we are talking about binary data, I suppose something like Blosc could be helpful (specially for inbound transfers).

— Reply to this email directly or view it on GitHubhttps://github.com/euroscipy/euroscipy_proceedings/pull/45#issuecomment-156844458. Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928

FrancescAlted commented 8 years ago

2015-11-16 16:03 GMT+01:00 Michael Klemm notifications@github.com:

This is certainly an interesting suggestion. In order to be effective it would be necessary to first tune the compression and decompression routines for the 61 cores of the Xeon Phi. To the best of our knowledge no off the shelf libraries -- including Blosc -- are capable of scaling out to this level.

Well I was thinking more on sending compressed chunks (< 256 KB so that they can fit in L2) to every core and that this core would be responsible to decompress the chunk. This way you can scale out with no need to distribute work explicitly.

Our opinion, therefore, is that currently the most effective technique for minimizing the impact of data transfers is to perform them asynchronously and have them overlap with useful computation. Such a technique can be applied to a variety of algorithms and unlike compression does not consume cycles. Support for fully asynchronous transfers is on the roadmap for a future version of pyMIC.

Yes, exactly my thoughts, but introducing compression at the end of the pipe, in one of the Phi cores. This certainly would consume CPU cycles, but some of these cycles are definitely lost because they are waiting for data arrival. On the other hand, compression would reduce the data size, saving cycles too. If the tradeoff between the saved cycles in data transmission and the ones required for decompression would be positive, then compression can be used for an advantage. If the tradeoff is negative, then compression would hurt performance. What I am saying is that doing a study on this would be interesting.

Cheers,

Francesc Alted

mjklemm commented 8 years ago

Hi,

Good suggestion. I will look into this!

Cheers, -michael

Dr.-Ing. Michael Klemm Senior Application Engineer Software and Services Group Developer Relations Division Phone +49 89 9914 2340 Cell +49 174 2417583

From: FrancescAlted [mailto:notifications@github.com] Sent: Monday, November 16, 2015 10:30 AM To: euroscipy/euroscipy_proceedings Cc: Klemm, Michael Subject: Re: [euroscipy_proceedings] [Paper] Using the pyMIC Offload Module in PyFR. Michael Klemm (#45)

2015-11-16 16:03 GMT+01:00 Michael Klemm notifications@github.com:

This is certainly an interesting suggestion. In order to be effective it would be necessary to first tune the compression and decompression routines for the 61 cores of the Xeon Phi. To the best of our knowledge no off the shelf libraries -- including Blosc -- are capable of scaling out to this level.

Well I was thinking more on sending compressed chunks (< 256 KB so that they can fit in L2) to every core and that this core would be responsible to decompress the chunk. This way you can scale out with no need to distribute work explicitly.

Our opinion, therefore, is that currently the most effective technique for minimizing the impact of data transfers is to perform them asynchronously and have them overlap with useful computation. Such a technique can be applied to a variety of algorithms and unlike compression does not consume cycles. Support for fully asynchronous transfers is on the roadmap for a future version of pyMIC.

Yes, exactly my thoughts, but introducing compression at the end of the pipe, in one of the Phi cores. This certainly would consume CPU cycles, but some of these cycles are definitely lost because they are waiting for data arrival. On the other hand, compression would reduce the data size, saving cycles too. If the tradeoff between the saved cycles in data transmission and the ones required for decompression would be positive, then compression can be used for an advantage. If the tradeoff is negative, then compression would hurt performance. What I am saying is that doing a study on this would be interesting.

Cheers,

Francesc Alted

— Reply to this email directly or view it on GitHubhttps://github.com/euroscipy/euroscipy_proceedings/pull/45#issuecomment-157088836. Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Christian Lamprechter Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928

aterrel commented 8 years ago

This paper gives an analysis of using the MIC architecture for the PyFR project. It does a good job of giving an overview of the architecture and porting a simple problem to evaluate the architecture. I think it to be an excellent contribution to EuroSciPy.

The only minor complaint I have, if I am reading the benchmarks correctly, is that they are comparing their specialized matrix computation for Riemann problems to a general matrix matrix multiplication. While it is always a good way to show how your specialized approach beats general an approach, it is good to state this rather than the bolder claim "It is shown by utilising pyMIC in combination with MKL how it is possible to obtain a substantial speedup for dgemm". Additionally, MKL has reported many times hitting the 1TFLOP performance for dgemm, which is substantially higher than the microbenchmarks in this report attain. Is there a reason for this discrepency? It seems you either are not computing a real dgemm or you are not using the Xeon Phi effectively for the benchmark.

Notes:

"However, this requires that Python, along with dependencies such as NumPy, be cross-compiled for the Intel coprocessor; a significant undertaking." -- We did exactly this at TACC for KNC. It was significantly faster but not really something we wanted to maintain.

Minor Issues:

"house decrete extension cards" -> "house discrete extension cards"

Additional references to consider:

This paper used a method of OpenCL generated kernels from Python to speed a massive parallel advection equation on MICs by a similar amount. http://arxiv.org/abs/1308.1472
- Relavant code: code-generator https://github.com/IgnitionProject/ignition , Riemann evaluation: https://github.com/ManyClaw/ManyClaw

mjklemm commented 8 years ago

Dear all, we have addressed most of the reviewer comments.

W.r.t to the usage of Mako: The first releases of PyFR did not have a DSL; rather each kernel was written individually for each backend with Mako simply being used as a superior C pre-processor. The decision to migrate towards a DSL -- which enables kernels to be shared between platforms -- was made during the development of v0.2.0.

In order to maintain compatibility with the existing frameworks within PyFR the DSL had to be bolted-on to the existing Mako templating engine. This results in some of the unusual syntax for defining things such as kernels and their arguments.

It is possible that a different mechanism chosen if PyFR was written with a DSL in mind from day one.

Kind regards,

Freddie & Michael

euroscipy / euroscipy_proceedings