JuliaCon / proceedings-review

6 stars 1 forks source link

[REVIEW]: Distributed Parallelization of xPU Stencil Computations in Julia #137

Open whedon opened 7 months ago

whedon commented 7 months ago

Submitting author: !--author-handle-->@omlins<!--end-author-handle-- (Samuel Omlin) Repository: https://github.com/omlins/ImplicitGlobalGrid.jl Branch with paper.md (empty if default branch): Version: Editor: !--editor-->@fcdimitr<!--end-editor-- Reviewers: @mloubout, @georgebisbas Archive: Pending

Status

status

Status badge code:

HTML: <a href="https://proceedings.juliacon.org/papers/d1c21de0f6f0bde752fab8428d2fa186"><img src="https://proceedings.juliacon.org/papers/d1c21de0f6f0bde752fab8428d2fa186/status.svg"></a>
Markdown: [![status](https://proceedings.juliacon.org/papers/d1c21de0f6f0bde752fab8428d2fa186/status.svg)](https://proceedings.juliacon.org/papers/d1c21de0f6f0bde752fab8428d2fa186)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@mloubout & @georgebisbas, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fcdimitr know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @mloubout

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Paper format

Content

Review checklist for @georgebisbas

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Paper format

Content

whedon commented 7 months ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @mloubout, @georgebisbas it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/JuliaCon/proceedings-review) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/JuliaCon/proceedings-review:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 7 months ago

Failed to discover a Statement of need section in paper

whedon commented 7 months ago

Wordcount for paper.tex is 1110

whedon commented 7 months ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.04 s (1395.4 files/s, 185772.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Julia                           24            368            277           2939
TeX                              8            244            177           2261
Markdown                         3             76              0            244
Bourne Shell                    11             78             84            132
YAML                             3              3              0             85
Ruby                             1              8              4             45
TOML                             3              4              0             27
-------------------------------------------------------------------------------
SUM:                            53            781            542           5733
-------------------------------------------------------------------------------

Statistical information for the repository '74fa0dfe549fde1b7d80b240' was
gathered on 2023/11/06.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Samuel Omlin                     3           114             57          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Samuel Omlin                 57           50.0          0.0                7.02
whedon commented 7 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 7 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/SC41405.2020.00062 is OK
- ??? is OK
- 10.21105/jcon.00068 is OK

MISSING DOIs

- 10.1137/141000671 may be a valid DOI for title: Julia: A fresh approach to numerical computing
- 10.1109/tpds.2018.2872064 may be a valid DOI for title: Effective extensible programming: unleashing Julia on GPUs

INVALID DOIs

- None
whedon commented 7 months ago

:wave: @mloubout, please update us on how your review is going (this is an automated reminder).

whedon commented 7 months ago

:wave: @georgebisbas, please update us on how your review is going (this is an automated reminder).

mloubout commented 7 months ago

Review

The abstract is written clearly and convey the message efficiently. The motivation and design principles are laid out clearly. The results show good performance and validate the claims of the author.

Some minor comments:

very minor:

With these addressed, this should be accepted for publication

georgebisbas commented 7 months ago

It seems like the repository: https://github.com/omlins/ImplicitGlobalGrid.jl does not have an active "Issues" Github tab. Is that expected?

omlins commented 6 months ago

It seems like the repository: https://github.com/omlins/ImplicitGlobalGrid.jl does not have an active "Issues" Github tab. Is that expected?

Yes this is expected, because it is a fork for the single purpose of the publication of this paper. The original repository is the following: https://github.com/eth-cscs/ImplicitGlobalGrid.jl

georgebisbas commented 6 months ago

It seems like the repository: https://github.com/omlins/ImplicitGlobalGrid.jl does not have an active "Issues" Github tab. Is that expected?

Yes this is expected, because it is a fork for the single purpose of the publication of this paper. The original repository is the following: https://github.com/eth-cscs/ImplicitGlobalGrid.jl

Thanks @omlins. Where should we open an Issue for the paper according to the instructions? At the original repo?

omlins commented 6 months ago

I do not think that for this configuration it is specified in the instructions. Thus, I would suggest to open issues concerning the manuscript on the fork (https://github.com/omlins/ImplicitGlobalGrid.jl) and potential issues concerning the package on the original repository (https://github.com/eth-cscs/ImplicitGlobalGrid.jl).

georgebisbas commented 6 months ago

I do not think that for this configuration it is specified in the instructions. Thus, I would suggest to open issues concerning the manuscript on the fork (https://github.com/omlins/ImplicitGlobalGrid.jl) and potential issues concerning the package on the original repository (https://github.com/eth-cscs/ImplicitGlobalGrid.jl).

Now I can see a tab of Issues: https://github.com/omlins/ImplicitGlobalGrid.jl/issues which was not there before. Thanks

omlins commented 6 months ago

I do not think that for this configuration it is specified in the instructions. Thus, I would suggest to open issues concerning the manuscript on the fork (https://github.com/omlins/ImplicitGlobalGrid.jl) and potential issues concerning the package on the original repository (https://github.com/eth-cscs/ImplicitGlobalGrid.jl).

Now I can see a tab of Issues: https://github.com/omlins/ImplicitGlobalGrid.jl/issues which was not there before. Thanks

Yes, I added it upon your question.

georgebisbas commented 5 months ago

ImplcitGlobalGrid.jl offers an approach to automate distributed memory parallelism for stencil computations in Julia, leveraging a high-level symbolic interface and presenting highly competitive performance.

This work has a high overlap with https://github.com/JuliaCon/proceedings-review/issues/138

As discussed in ParallelStencil.jI, some changes should be applied to publish those as separate works. I understand that this work extends the optimization suite available in ParallelStencl.jl by allowing the use of automated Distributed Memory Parallelism. Since both packages will be published independently, mentioning that DMP is built upon the ParallelStencil.jl would not harm.

I suggest avoiding using the term xPU in the title. In the recent rise of “exotic” accelerators like Google’s TPU, Graphcore’s IPU, or expected QPUs, it probably would not harm just to use:

"Distributed Parallelization of Stencil Computations in Julia"

This is clarified at the end of the Introduction. However, I still advocate avoiding this term. It may also not help when someone will be looking for your work in the future.

The package's functionality is confirmed, and the installation, testing, and use documentation are present and well-defined. The paper is well-written, clearly presenting the motivation and example usage of the package. It may be good to add explicitly that users get automated DMP by only slightly editing the source code of a solver. I.e., the text can value the symbolic interface a bit more.

I am interested in how communication computation overlap is performed. Would adding some more details about this in Section 2 be easy?

The diffusion example presented is well-written. Minor comments (not aiming to be addressed, but possibly interesting points for future work and further discussion):

Performance evaluation:

Good to see that you are using a non-standard stencil kernel in Figure 3. Please add more info on this stencil kernel's computation and memory requirements.

I also second the comments posted by @mloubout.

Assuming the above concerns are addressed, this should be accepted for publication.

omlins commented 4 months ago

@mloubout: Thank you very much for the thorough review. We are working on addressing the reviews. In the following you can find replies to two of the issues that you raised. We will respond to the remaining issues alongside with the improvements of the manuscript.

"We observe a parallel efficiency of 93% on 2197 GPUs." this is an odd number of GPUs to use for scaling, why not use 1024 similar to the parallel efficiency

The reason for the scaling up to 2197 GPUs or nodes is that 17^3 (=2197) nodes is the biggest cubic node topology that can be submitted in the normal queue of Piz Daint (up to 2400 nodes, see: https://user.cscs.ch/access/running/piz_daint/#slurm-batch-queues). The number of 2197 GPUs appears to surprise when it is not explained, but there is nothing negative about it as the reported parallel efficiency confirms. Thus, we would like to add an explanation while keeping the plot as is.

Fig 3 could use the performance comparison since the author admit that the CUDA implementation outperforms the Juia implementation and this figure alone is slightly miss-leading.

Thank you for pointing this out; we will try to emphasize the performance difference.

omlins commented 4 months ago

@georgebisbas: Thank you very much for the thorough review. We are working on addressing the reviews. In the following you can find replies to many of the issues that you raised. We will respond to the remaining issues alongside with the improvements of the manuscript.

ImplcitGlobalGrid.jl offers an approach to automate distributed memory parallelism for stencil computations in Julia, leveraging a high-level symbolic interface and presenting highly competitive performance. (...) I understand that this work extends the optimization suite available in ParallelStencil.jl by allowing the use of automated Distributed Memory Parallelism. Since both packages will be published independently, mentioning that DMP is built upon the ParallelStencil.jl would not harm.

Thank you for this statement; it makes clear that we need to emphazise in the document that ImplicitGlobalGrid.jl is not built upon ParallelStencil.jl. ImplicitGlobalGrid.jl does not in any way assume or require that codes using it also use ParallelStencil.jl (the same is also true vice versa). In the documentation, there is a Multi-GPU example without ParallelStencil.jl (https://github.com/eth-cscs/ImplicitGlobalGrid.jl?tab=readme-ov-file#50-lines-multi-gpu-example). In addition to pointing this out, we will also refer to the example in the documentation.

The three main reasons why we chose an example using ParallelStencil.jl are 1) to illustrate interoperability with ParallelStencil.jl, which is a natural choice in Julia for the task at hand, 2) to achieve high-performance on a single-node, and 3) to illustrate the following: "all data transfers are performed on non-blocking high-priority streams or queues, allowing to overlap the communication optimally with computation. ParallelStencil.jl, e.g., can do so with a simple macro call (Fig. 1, line 36)" (page 2, paragraph 1).

It may be good to add explicitly that users get automated DMP by only slightly editing the source code of a solver. I.e., the text can value the symbolic interface a bit more.

The previous point showed that we were missing to point out that ImplicitGlobalGrid.jl is not built upon ParallelStencil.jl which unfortunately comes out again in this comment: the symbolic interface is not part of ImplicitGlobalGrid.jl, but of ParallelStencil.jl, and plays no role in the distributed parallelization (with exception of the @hide_communication feature of ParallelStencil.jl, which can be optionally used and without ImplicitGlobalGrid.jl being aware of it or affected by it). The almost trivial distributed memory parallelization is achieved with calls to the functions provided by ImplicitGlobalGrid.jl and is described here: "as little as three functions can be enough to transform a single xPU application into a massively scaling multi-xPU application: a first function creates the implicit global staggered grid, a second function performs a halo update on it, and a third function finalizes the global grid. Fig. 1 shows a stencil-based 3-D heat diffusion xPU solver, where distributed parallelization is achieved with these three ImplicitGlobalGrid functions (lines 23, 38 and 43) plus some additional functions to query the size of the global grid (lines 24-26; note that shared memory parallelization is performed with ParallelStencil [3])" (page 1, last paragraph)

I am interested in how communication computation overlap is performed. Would adding some more details about this in Section 2 be easy?

As noted in the previous two points, ImplicitGlobalGrid.jl does itself not perform a communication computation overlap, but it performs the communication in a way that enables other packages to do it easily an optimally: "all data transfers are performed on non-blocking high-priority streams or queues" (page 2, paragraph 1). Other packages as, e.g., ParallelStencil.jl can leverage this to perform communication computation overlap: "ParallelStencil.jl, e.g., can do so with a simple macro call (Fig. 1, line 36)" (page 2, paragraph 1). Thus, here is already described what ImplicitGlobalGrid.jl does concerning the communication computation overlap.

I suggest avoiding using the term xPU in the title. In the recent rise of “exotic” accelerators like Google’s TPU, Graphcore’s IPU, or expected QPUs, it probably would not harm just to use: "Distributed Parallelization of Stencil Computations in Julia"

Thank you for the suggestion. We can understand your point. However, if we removed the 'xPU', then many people would assume that this work applies only to CPU when reading the title. The 'xPU' is important for us to emphasise the backend portability and we believe normal that the 'x' of 'xPU' refers only to a subset of what exists. Furthermore, ImplicitGlobalGrid.jl was designed to be easily extendable with new backends, which can be for other GPU kinds or other kinds of xPU (and we will certainly add new backends in the future for any hardware that is of interest for us). We will try to emphasize more on this.

Why are you using 2197 GPUs?

The reason for the scaling up to 2197 GPUs or nodes is that 17^3 (=2197) nodes is the biggest cubic node topology that can be submitted in the normal queue of Piz Daint (up to 2400 nodes, see: https://user.cscs.ch/access/running/piz_daint/#slurm-batch-queues). The number of 2197 GPUs appears to surprise when it is not explained, but there is nothing negative about it as the reported parallel efficiency confirms. Thus, we would like to add an explanation while keeping the plot as is.

Why is CPU scaling not presented since the paper claims xPUs?

Given that CPU-only supercomputers are becoming rare and given that the challenge is higher when GPUs are involved, we considered it sufficient to focus on GPUs in this short paper. We see the interest of the CPU functionality mostly for code validation and possibly prototyping.

Why is strong scaling not presented in this work?

We do not have any use cases where we have a strictly fixed problem size. Thus, we considered presenting weak scaling more important.

How good is single-node performance?

The single-node performance is very high thanks to the implementation using ParallelStencil.jl.

georgebisbas commented 4 months ago

The single-node performance is very high thanks to the implementation using ParallelStencil.jl.

Are there any performance numbers, like Gpts/s or GCells/s or any roofline model, we could look at?

fcdimitr commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago

:warning: An error happened when generating the pdf.

fcdimitr commented 3 months ago

The error might be due to missing DOIs. @lucaferranti is my understanding correct?

@omlins could you add the missing DOIs and try to generate the PDF again?

Thank you!

omlins commented 3 months ago

@fcdimitr : please excuse the delay for the reply. We aim to finalize this extended abstract hopefully this and next week.

I have updated the bibliography file; I think now it should be able to generate the PDF.

omlins commented 3 months ago

@mloubout :

Staggered grid can lead to asymmetrical halos (potentially send only/recv only), a small explanation on how it is handled would be useful.

Thank you for pointing this out. In our approach, a field will only have halos in a given dimension if the corresponding overlap between the local fields is at least two cells wide; no halos are created if the overlap is only one cell wide (redundant computation is done instead), which avoids the need to deal with asymmetrical halos. We will see how to add this to the document.

How does it compare to similar DSLs for distributed stencil/pde solving such as Firedrake, Fenics, Devito

Thank you for your comment. We will be happy to do a comparison with related work as the one you mentioned, and with other key work in the vast fields of distributed memory parallelization and architecture-agnostic programming, as well as, more generally speaking, with important work that shares our objective to contribute to solving the challenge of the 3 “P”s -- (scalable) Performance, (performance) Portability and Productivity --. However, given the large amount of work our contribution is related to, we aim to do so at a later point in a full paper submission. We did not have the space to do so in this very short extended abstract format of "at most two pages of content including references" (see author's guide: https://juliacon.github.io/proceedings-guide/author/), which "lays out in a concise fashion the methodology and use cases of the work presented at the conference", as opposed to a full paper submission, which "compared to an extended abstract, (...) compares the work to other approaches taken in the field and gives some additional insights on the conference contribution".

fcdimitr commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

fcdimitr commented 4 weeks ago

Hi @omlins! I'm just checking in; what is the status of this? Thank you!