Reviewers comments - Githubissues

davydden commented 5 years ago

Reviewer1:

[x] mention abstract/conclusion/.... that we focus on node level comparison (thus won't study 10^12 DoFs) @davydden #53
[x] suggests to start with profiling and bottlenecks (for Trilinos?!). And only when bandwidth is identified as a bottleneck, we consider algorithms that trade bandwidth against flops (MF).
[x] add Roofline profiling results for Trilinos to show memory bound case @davydden #53
[ ] HOW to distinguish between latency limitations vs bandwidth? Cache misses?
[x] argument about MF better for higher order. He says otherwise! He is surprised that the performance improves with the polynomial degree. He missed that it's time/DoF? @masterleinad EDIT: references added in #61
[x] performance analysis for SIMD vectorization or link to Martin's and co work. @davydden #53
[x] Explain that all operations on tensors are vectorized. OOP facilities in deal.II. @davydden #53
[x] asks for some analysis for GMG, transfer operators, etc. Add a note that it is NOT the purpose, at this point we only compare it directly to AMG? @davydden #53
[x] mention how MPI parallelization is done. @davydden #53
[x] add Roofline profiling results for MF operator using LIKWID. Report achieved memory bandwidth and compare to machine limits. @davydden #53
[x] explain in detail why MF finite-strain tangent turned out to be faster (based on Roofline, etc). @davydden #53
[x] analysis of the overheads (MPI parallelization) is missing. Argue that we are focused on chip/node level? @davydden #53
[x] argue that we do NOT intend to study scalability (i.e. focus on node level performance)? Or add a few extra MF examples (larger/smaller) to study caching effects?
[x] provide a credible evidence that bandwidth bottlenecks of Trilinos can be removed by MF (Roofline?). @davydden #53

Reviewer2:

[x] Make presentation of original contribution/novelty clear @davydden #53
[x] State what is used from deal.II (MappingQEulerian, what Martin did (contractions), etc @davydden #53
[x] remove TBB for clarity @davydden #53
[ ] Improve GMG section
- [x] reference to Chebyshev @davydden #53
- [x] parameters for Chebyshev @davydden #53
- [x] state novelty part (new definition of level operators, otherwise rely on deal.II) @davydden #53
[x] explicitly refer to techniques employed from the literature with references. @davydden #53
[ ] higher polynomial degrees are not treated properly (?!) -- no idea what he means.
[x] mention that p-multigrid is not possible in deal.II @davydden #53
[x] relevance of the material model (application / references) @jppelteret EDIT: done in #61
[x] remove constants from operation counts @davydden #53
[ ] n quadrature points is sufficients for degree n-1 basis for non-linear material? (add comment) @jppelteret (?)
[x] introductory sentence (point 4. in his comment) @davydden #53
[x] explain why it is feasible to define MF operators for any constitutive law. @davydden #53
[x] page 21 line 55, cubic elements in 2D @davydden #53

Other ideas:

[x] rework We are interested in the following metrics in numerical results. This shall be adjusted once we have Roofline data. @davydden #53
[x] ~~remove MPI at all~~ nah, let's keep it.
[x] ~~disable SIMD to see its effect?~~ Would just link to Martin's work @davydden #53
[x] note that we get good performance of preconditioner (Linear Algebra) @davydden #53
[x] state that we use matrix-free facilities of deal.II @davydden #53
[x] ~~plot/reference to number of (analytical?) operations for cell level operation (Laplace) matrix-free vs matrix-based.~~ Link to Martin's work @davydden #53
[x] ~~also get LIKWID data for @masterleinad cluster (who has slower memory access? should see that as larger gap between MF and MB)~~ EDIT: it should be enough to have one.
[x] we could do a breakdown of the total computation time of vmult (see 5.1.2 in Kronbichler 2012) into (i) vector read and write, (ii) computation of cell gradients and contructions, (iii) quadrature loop. I would say we do this for tensor4 strategy only. @davydden But I am more worried about our current Roofline results...
[x] along the same lines, we could also added extra LIKWID markers for those steps (similar to page 17 in https://github.com/davydden/large-strain-matrix-free/files/2746959/presentation_martin.pdf @masterleinad what do you think? EDIT: I am on it, will add to #53
[x] need to decide whether we measure breakdown on same mesh or smaller one, no advice from LIKWID google forum yet. EDIT: in two weeks no one bothered to reply, fuck it, go with a small as we already do!
[x] LIKWID: report Runtime (RDTSC) or Runtime unhalted as walltime? @davydden EDIT: clarified that with LIKWID guys (G. Haager) that Runtime (RDTSC) is the one which should represent wallclock time. Also see https://github.com/davydden/large-strain-matrix-free/pull/57#issuecomment-459382597
[x] we should adjust the code to output not only memory for cache (current state) but also memory in MatrixFree class which should also store inverses of the Jacobians. @davydden #54 EDIT: do not measure diagonals, Martin has in his slides "Memory per degree of freedom ≈ main memory transfer"
[x] LIKWID measurements for 3D @davydden #53
[x] LIKWID measurements without vectorization (DEAL_II_COMPILER_VECTORIZATION_LEVEL, omit -march=native or modify ../cmake/checks/check_01_cpu_features.cmake).
[x] LIKWID: report how peak performance is obtained (with FMA?), additionally add roofs like without Vectorization, without FMA? @davydden #53
[x] @masterleinad will meet Martin in person on 20.01, we should have first draft with LIKWID results ready by that time so that @masterleinad can show Martin first reviewer's letter and our modifications and ask for advice. @davydden #53
[ ] Wolfgang Wall (TUM), Karl Lungkvist, Dr. Georg Haager (FAU, HPC) as new suggested referees for IJNME? Most likely HPC reviewer was Prof. Ruede (FAU) 😞
[ ] we might express wall times in as Martin did: "times in Table 2 are expressed in terms of million degrees of freedom in order to compensate for different problem sizes for different polynomial orders." This might be a good idea as sometimes we have a factor of x4 difference in DoFs.
[x] mention xSDK (kind of dealii is performance oriented) and ~~in preparation paper of Martin for comparison of matrix-free frameworks in open source software~~
[x] Mention that we're using an existing MatrixFree framework EDIT: mentioned in #61
[x] List the feature set that the MF framework already incorporates EDIT: done in #53
[x] Explain that this paper explores the idea of doing solid mechanics with MF, that its not completely optimised and that there is still more work to do. (but we also prove that in the unoptimised way we prove that it is competitive against [ and even more performant than...] MB + AMG. So here we do the first exploratory steps and identify bottlenecks and possible areas in which further research should be conducted. "This is how you could do it, and it looks promising".
[ ] mention that we can improve the number of iterations by choosing a better coarse grid solver (which is too expensive). mention/reinforce that how the coarse level preconditioner is setup will greatly affect the results of the solver.

masterleinad commented 5 years ago

* suggests to start with profiling and bottlenecks (for Trilinos?!). And only when bandwidth is identified as a bottleneck, we consider algorithms that trade bandwidth against flops (MF).

I guess we are fine here as well with the LIKWID results.

masterleinad commented 5 years ago

* argue that we do NOT intend to study scalability (i.e. focus on node level performance)? Or add a few extra MF examples (larger/smaller) to study caching effects?

I guess we addressed that as well.

davydden commented 5 years ago

agreed. I have LIKWID results for separate stages of the operator, will plot them today and update the manuscript if they are decent...

masterleinad commented 5 years ago

Martin thinks that the content of the article is sufficient and interesting. On the other hand, he is missing a little bit the golden thread. We should explain our expectations a little bit earlier. It might be a good idea to have Figures 6 and 7 a little bit earlier. He suggests explaining some more the expectations for the three caching strategies, i.e. count the number of elements actually stored and loaded and the number of operations per quadrature point in the inner loop (maybe also include some of the back-of-the-envelop calculations in https://github.com/CEED/Forum/issues/1) We should also try to explain the roofline model a bit more. We might, for example, say we have indirect memory access and might also be core-bound. He agrees that running without vectorization might be interesting as well to address one of the reviewer's questions. Martin was quite surprised about the question regarding memory latency. He says this does not play any role since the data can be loaded perfectly. We are really memory-bandwith-limited. Furthermore, he says that the expectations for lower polynomial degrees being better for the processor really comes from a matrix-based view. In the end, we should have sufficient data and references to clarify that high polynomial degrees are expected to give more flops.

masterleinad commented 5 years ago

Maybe, also provide raw numbers for Figures 6 and 7. In the end, the scaling of the y-axes makes distinguishing the matix-free results very difficult.

masterleinad commented 5 years ago

We also have a strong support here that node-level performance is crucial. Scaling then with respect to MPI is less an issue (and less interesting).

masterleinad commented 5 years ago

I guess we can address all this within the next week(s).

jppelteret commented 5 years ago

I guess we can address all this within the next week(s).

I know that I have not been contributing much, but after my vacation my time been almost completely consumed other work mandated by our boss. I've been promising @davydden that I'd get onto this soon, so you all can expect me to start reading through these changes this week.

Thanks very much to both of you for the continued and very dedicated work :-)

jppelteret commented 5 years ago

Course of action & distribution of work (goal deadline 1 Feb 2019)

[x] More results: Influence of MPI and SIMD ( @davydden ) #57
[x] measure FLOPS for C:\sigma using LIKWID on a stand-alone benchmark (@davydden ) #59
[ ] Back of the envelope calculations for intensity starting from Alg. 3 ( @masterleinad )
[x] Work on text (tidying up); introduce section on general MPI and deal.II implementation ( @jppelteret ) #58
[x] Rebuttal letter to editor & reviewers ( @jppelteret ; @davydden -- later due date )

Rebuttal process

Resubmit to same journal
Ask editor to change status to major revision (email editor with PDF to peruse?)
Addressed all points of both reviewers

davydden commented 5 years ago

resubmitted version 3602de5

davydden / large-strain-matrix-free

Reviewers comments #52