Closed davydden closed 5 years ago
* suggests to start with profiling and bottlenecks (for Trilinos?!). And only when bandwidth is identified as a bottleneck, we consider algorithms that trade bandwidth against flops (MF).
I guess we are fine here as well with the LIKWID
results.
* argue that we do NOT intend to study scalability (i.e. focus on node level performance)? Or add a few extra MF examples (larger/smaller) to study caching effects?
I guess we addressed that as well.
agreed. I have LIKWID results for separate stages of the operator, will plot them today and update the manuscript if they are decent...
Martin thinks that the content of the article is sufficient and interesting. On the other hand, he is missing a little bit the golden thread. We should explain our expectations a little bit earlier. It might be a good idea to have Figures 6 and 7 a little bit earlier. He suggests explaining some more the expectations for the three caching strategies, i.e. count the number of elements actually stored and loaded and the number of operations per quadrature point in the inner loop (maybe also include some of the back-of-the-envelop calculations in https://github.com/CEED/Forum/issues/1) We should also try to explain the roofline model a bit more. We might, for example, say we have indirect memory access and might also be core-bound. He agrees that running without vectorization might be interesting as well to address one of the reviewer's questions. Martin was quite surprised about the question regarding memory latency. He says this does not play any role since the data can be loaded perfectly. We are really memory-bandwith-limited. Furthermore, he says that the expectations for lower polynomial degrees being better for the processor really comes from a matrix-based view. In the end, we should have sufficient data and references to clarify that high polynomial degrees are expected to give more flops.
Maybe, also provide raw numbers for Figures 6 and 7. In the end, the scaling of the y-axes makes distinguishing the matix-free results very difficult.
We also have a strong support here that node-level performance is crucial. Scaling then with respect to MPI
is less an issue (and less interesting).
I guess we can address all this within the next week(s).
I guess we can address all this within the next week(s).
I know that I have not been contributing much, but after my vacation my time been almost completely consumed other work mandated by our boss. I've been promising @davydden that I'd get onto this soon, so you all can expect me to start reading through these changes this week.
Thanks very much to both of you for the continued and very dedicated work :-)
Course of action & distribution of work (goal deadline 1 Feb 2019)
C:\sigma
using LIKWID on a stand-alone benchmark (@davydden ) #59Rebuttal process
resubmitted version 3602de5
Reviewer1:
time/DoF
? @masterleinad EDIT: references added in #61Reviewer2:
Other ideas:
[x] rework
We are interested in the following metrics
in numerical results. This shall be adjusted once we have Roofline data. @davydden #53[x]
remove MPI at allnah, let's keep it.[x]
disable SIMD to see its effect?Would just link to Martin's work @davydden #53[x] note that we get good performance of preconditioner (Linear Algebra) @davydden #53
[x] state that we use matrix-free facilities of deal.II @davydden #53
[x]
plot/reference to number of (analytical?) operations for cell level operation (Laplace) matrix-free vs matrix-based.Link to Martin's work @davydden #53[x]
also get LIKWID data for @masterleinad cluster (who has slower memory access? should see that as larger gap between MF and MB)EDIT: it should be enough to have one.[x] we could do a breakdown of the total computation time of
vmult
(see 5.1.2 in Kronbichler 2012) into(i) vector read and write
,(ii) computation of cell gradients and contructions
,(iii) quadrature loop
. I would say we do this fortensor4
strategy only. @davydden But I am more worried about our current Roofline results...[x] along the same lines, we could also added extra LIKWID markers for those steps (similar to page 17 in https://github.com/davydden/large-strain-matrix-free/files/2746959/presentation_martin.pdf @masterleinad what do you think? EDIT: I am on it, will add to #53
[x] need to decide whether we measure breakdown on same mesh or smaller one, no advice from LIKWID google forum yet. EDIT: in two weeks no one bothered to reply, fuck it, go with a small as we already do!
[x] LIKWID: report
Runtime (RDTSC)
orRuntime unhalted
as walltime? @davydden EDIT: clarified that with LIKWID guys (G. Haager) thatRuntime (RDTSC)
is the one which should represent wallclock time. Also see https://github.com/davydden/large-strain-matrix-free/pull/57#issuecomment-459382597[x] we should adjust the code to output not only memory for cache (current state) but also memory in
MatrixFree
class which should also store inverses of the Jacobians. @davydden #54 EDIT: do not measure diagonals, Martin has in his slides "Memory per degree of freedom ≈ main memory transfer"[x] LIKWID measurements for 3D @davydden #53
[x] LIKWID measurements without vectorization (
DEAL_II_COMPILER_VECTORIZATION_LEVEL
, omit-march=native
or modify../cmake/checks/check_01_cpu_features.cmake
).[x] LIKWID: report how peak performance is obtained (with FMA?), additionally add roofs like
without Vectorization
,without FMA
? @davydden #53[x] @masterleinad will meet Martin in person on 20.01, we should have first draft with LIKWID results ready by that time so that @masterleinad can show Martin first reviewer's letter and our modifications and ask for advice. @davydden #53
[ ] Wolfgang Wall (TUM), Karl Lungkvist, Dr. Georg Haager (FAU, HPC) as new suggested referees for IJNME? Most likely HPC reviewer was Prof. Ruede (FAU) 😞
[ ] we might express wall times in as Martin did: "times in Table 2 are expressed in terms of million degrees of freedom in order to compensate for different problem sizes for different polynomial orders." This might be a good idea as sometimes we have a factor of
x4
difference in DoFs.[x] mention xSDK (kind of
dealii
is performance oriented) andin preparation paper of Martin for comparison of matrix-free frameworks in open source software[x] Mention that we're using an existing MatrixFree framework EDIT: mentioned in #61
[x] List the feature set that the MF framework already incorporates EDIT: done in #53
[x] Explain that this paper explores the idea of doing solid mechanics with MF, that its not completely optimised and that there is still more work to do. (but we also prove that in the unoptimised way we prove that it is competitive against [ and even more performant than...] MB + AMG. So here we do the first exploratory steps and identify bottlenecks and possible areas in which further research should be conducted. "This is how you could do it, and it looks promising".
[ ] mention that we can improve the number of iterations by choosing a better coarse grid solver (which is too expensive). mention/reinforce that how the coarse level preconditioner is setup will greatly affect the results of the solver.