Closed ranocha closed 5 years ago
I've implemented possibility 3. from above here. It seems to be fine and I get some speedups, depending on the hardware and the problem at hand:
Testcase | CPU | GPU |
---|---|---|
induction_equation.m |
50 % | 15 % |
ideal_MHD.m |
3 % | 20 % |
From my point of view, this PR is finished an can be merged; it closes #6.
I've rebased on master to fix the merge conflicts.
I've rebased on master to fix merge conflicts and adapted the ideal_gas_Euler
parts. Here are the new speedups (EDIT: With the new commit mentioned below)
Testcase | CPU | GPU |
---|---|---|
linear_constant_advection.m |
0 % | 0 % |
linear_variable_advection.m |
40 % | 20 % |
induction_equation.m |
50 % | 15 % |
ideal_gas_Euler.m , USE_FLUX_KennedyGruber |
30 % | 30 % |
ideal_gas_Euler.m , USE_FLUX_Chandrashekar |
2 % | 15 % |
ideal_MHD.m |
3 % | 20 % |
I've encountered a strange problem that I don't understand currently: Running ideal_gas_Euler.m
with USE_FLUX_Chandrashekar
and USE_ARRAY_OF_STRUCTURES
is okay and takes 76 seconds on my GPU. Using the CPU, it takes 460 seconds and the computation blows up (NaN
). The same problem occurs for USE_STRUCTURE_OF_ARRAYS
.
@Kostaszki: Could you please have a general look at this PR? Do you have the same problem with USE_FLUX_Chandrashekar
? Maybe we might ignore this problem at first, merge this PR, and investigate it later?
I've fixed the failure with USE_FLUX_Chandrashekar
on my Intel CPU/GPU. Besides, that reduced the runtime on the Nvidia GPU. The basic difference is that
REAL F = (u < (REAL)(1.0e-2)) * (1 + u * ((REAL)(1.0/3.0) + u * ((REAL)(1.0/5.0) + u * (REAL)(1.0/7.0))))
+ (u >= (REAL)(1.0e-2)) * (log(zeta) / (2*f));
is replaced with
REAL F = (u < (REAL)(1.0e-2)) ? (1 + u * ((REAL)(1.0/3.0) + u * ((REAL)(1.0/5.0) + u * (REAL)(1.0/7.0))))
: (log(zeta) / (2*f));
Works fine in general. Tested on GTX 1060, Ryzen 5 and Ryzen 7. AoS is consistenly slower.
There is a small bug in initialize.m
, num_nodes
needs to be casted: group_size = 2^floor(log(double(num_nodes)) / log(2));
Additionally numerical results for AoS and SoA seem to be inconsistent for smaller N
.
Thanks, I've fixed initialize.m
.
The plot problems for small N
should be fixed now. Thanks for testing, @philipheinisch!
Do you approve these changes, @philipheinisch? Can we (squash-) merge this PR?
This is work in progress and should not be merged now!We want to compare "array of structures" and "structure of arrays", cf. #6. The new possibility is implemented using a#define
andI_Tech('memory_layout') = 'USE_STRUCTURE_OF_ARRAYS'
in matlab.Up to now, we had an array of structures. The other possibility is implemented additionally and everything is encapsulated via
get_field
,set_field
etc. Most computations on my hardware/software are fine, the only exception is the computation of norms ifnum_nodes != num_nodes_pad
. Otherwise, the new memory layout causes additional errors.Do you have any suggestions how to correct the computation of the norms, @philipheinisch @Kostaszki?
Here are some possibilities:
num_nodes_pad
and use onlynum_nodes == NODES_X * NODES_Y * NODES_Z
.norm2
,norm_infty
etc.NUM_NODES_PAD
to the OpenCL part.Possibilities 1. and 2. are similar and could impact the performance. Nevertheless, we don't compute norms etc. in performance critical parts up to now. I would expect that possibility 3. allows the best performance but increases the code complexity (a new constant has to be defined) a bit.