Closed annacfs closed 2 years ago
Hi,
I've tried your codes, and the chapel code (compiled with chpl-1.23) was more than x9 times slower than the Fortran code (compiled with gfortran -O3). Here, I did not use -Ofast
because it might give an erroneous result, although it gives an additional x2 speedup. (Btw, I confirmed that the programs ran serially with only 1 core in both Chapel and Fortran).
I tried several modifications for the chapel code, and it seems the following changes give the most significant speedup:
norm(x,p)
seems rather slow, so replacing it with sqrt( + reduce (x ** 2) )
pnew = pnew - pnew[2,2]
with var tmp = pnew[2,2]; pnew -= tmp;
sor_method()
and calc_erro()
to "subroutine" (in the Fortran sense), such that they modify
the dummy array argument f
or errol
directly, rather than declaring new local arrays inside the function and return them to the caller every time (which is similar to allocating allocatable
arrays in Fortran and return them from functions)With these modifications the timing reduced from 8.3 sec to ~3 sec. Furthermore, I replaced all assignment of array sections like unew[3..Nx-1,1] = - unew[3..Nx-1,2];
to explicit for-loops so as to eliminate possible overhead from the underlying parallel mechanism (used in a serial
block), which reduced the time from 3 to 1.2 sec. This last result is close to the timing with gfortran -O3 (0.9 sec). I couldn't make it faster than this, so I just gave up there :)
I think this last point (array section) is very tricky to get good performance... I hope there would be some syntax like unew[3 : Nx-1,1] = - unew[3 : Nx-1,2];
which is guaranteed to be translated (by the compiler) internally to an explicit serial for-loop like for i in 3..Nx-1 { unew[i, 1] = - unew[i, 2]; }
, although the use of "colon" may be problematic. I think this is particularly useful for manipulations of array sections with short length (like 1..3).
In my experience, I could not better Fortran by much. Sometimes I was 10$ slower and sometimes I was 10% faster in serial mode.
I suggest you send an email to the guys at the Polytechnique of Montreal as they have lots of experience using Chapel for CFD.
You use double dots, not a colon, for array slices, in Chapel.
I would look at seriously trying to simpify that algebra.
Given my experience with Chapel's vectorization issues, I would try what follows below. As I do not know the exact state of pay as to Chapel's latest vectorization enhancements, @mppf might suggest something better. The guys in Montreal may have even another perspective.
const dxq = dx * 4.0;
const dyq = dy * 4.0;
for i in 2..Nx-1 do
{
const ref uni = un[i, ..];
const ref unip1 = un[i+1, ..];
const ref unip2 = un[i+2, ..];
const ref vnim1 = vn[i-1, ..];
const ref vni = vn[i, ..];
const ref vnip1 = vn[i+1, ..];
ref vxip1 = VISCx[i+1, ..];
ref vyi = VISCy[i, ..];
ref cxip1 = CONVx[i+1, ..];
ref cyi = CONVy[i, ..];
ref usip1 = ustar[i+1, ..];
ref vsi = vstar[i, ..];
for j in 2..Ny-1 do
{
// the following might be supoptimal under good vectorization
const unip1j = unip1[j];
const vnijp1 = vni[j+1];
const unip1jX2 = unip1j + unip1j;
const vnijp1X2 = vnijp1 + vnijp1;
const uvj = (unip1[j+1] + unip1j) * (vnip1[j+1] + vnijp1);
const vvj = vni[j] + vnijp1;
const uuj = uni[j] + unijp1;
// these next 2 statements could be numerically suspect - check!!!
const vip1j = (unip1[j+1] - unip1jX2 + unip1[j-1]) / dy2
+ (unip2[j] - unip1jX2 + uni[j]) / dx2;
const vijp1 = (vni[j+2] - vnijp1X2 + vni[j]) / dy2 +
+ (vnip1[j+1] - vnijp1X2 + vnim1[j+1]) / dx2;
const cip1j = (uvj - (unip1[j-1] + unip1j) * (vnip1[j] + vni[j])) / dyq
+ ((unip2[j] + unip1j) * (unip2[j] + unip1j) - uuj * uuj) / dxq;
const cijp1 = ((vni[j+2] + vnijp1)*(vni[j+2] + vnijp1) - vvj*vvj) / dyq
+ (uvj - (uni[j] + uni[j+1]) * (vnijp1 + vnim1[j+1])) / dxq;
vxip1[j] = vip1j;
vyi[j+1] = vijp1;
cxip1[j] = cip1j;
cyi[j+1] = cijp1;
usip1[j] = dt * (vkinem * vip1j - cip1j) + unip1j;
vsi[j+1] = dt * (vkinem * vijpa - cijp1) + vnijp1;
}
}
I wrote these on a Saturday morning. They could quite possible be wrong algebraically so please check I got them right!!
My quick fixes are wrong algebraically, although I think only in c1p1j
and cijp1
. But you get the drift. Do not rely on common sub-expression optimization being done by the compiler because it will not fix the numerical errors that exist in that code.
That said, you need to simplify the algebra yourself. Chapel's optimizer is a little more susceptible to bad algebra than Fortran's, but then again, it seems to work better than Fortran's when given clean algebra.
Oi, eu tentei seus códigos, e o código da capela (compilado com chpl-1.23) foi mais de x9 vezes mais lento que o código Fortran (compilado com gfortran -O3). Aqui, eu não usei
-Ofast
porque pode dar um resultado errôneo, embora dê uma aceleração x2 adicional. (A propósito, confirmei que os programas rodavam em série com apenas 1 núcleo tanto na Chapel quanto no Fortran).Eu tentei várias modificações para o código da capela, e parece que as seguintes mudanças dão a aceleração mais significativa:
norm(x,p)
parece bastante lento, então substituí-lo porsqrt( + reduce (x ** 2) )
- substituir
pnew = pnew - pnew[2,2]
porvar tmp = pnew[2,2]; pnew -= tmp;
- substituir a função
sor_method()
ecalc_erro()
"sub-rotina" (no sentido Fortran), de modo que eles modifiquem o argumento do array fictíciof
ouerrol
diretamente, em vez de declarar novos arrays locais dentro da função e devolvê-los ao chamador todas as vezes (o que é semelhante a alocarallocatable
arrays em Fortran e devolvê-los de funções)Com essas modificações, o tempo foi reduzido de 8,3 segundos para ~3 segundos. Além disso, substituí toda a atribuição de seções de matriz como
unew[3..Nx-1,1] = - unew[3..Nx-1,2];
loops for explícitos para eliminar possíveis sobrecargas do mecanismo paralelo subjacente (usado em umserial
bloco), o que reduziu o tempo de 3 para 1,2 segundos. Este último resultado está próximo do tempo com gfortran-O3 (0,9 seg). Não consegui fazer mais rápido do que isso, então desisti de lá :)Eu acho que este último ponto (seção de matriz) é muito complicado para obter um bom desempenho ... Espero que haja alguma sintaxe como
unew[3 : Nx-1,1] = - unew[3 : Nx-1,2];
a que é garantida para ser traduzida (pelo compilador) internamente para um loop for serial explícitofor i in 3..Nx-1 { unew[i, 1] = - unew[i, 2]; }
, embora o o uso de "dois pontos" pode ser problemático. Eu acho que isso é particularmente útil para manipulações de seções de array com comprimento curto (como 1..3).
my advisor asked test the chapel and fortran in serial, because the codes in fluid mechanics are programmed in fortran or C. The idea is to test if the chapel could replace the fortran + MPI. That's why the serial was tested first.
Testing the serial performance first is a very sound approach. It allows you to get the details of your algorithm correct without parallelism potentially obscuring issues.
There are several common subexpressions in your formulae which are not obvious until you factor some of the expressions. That factoring also highlights where you have some numerical issues in those formulae.
On the whole question of Chapel being able to replace Fortran and C/C++ for computational fluid mechanics, that is largely proven if you look at some of the papers by Eric Laurendeau, Matthieu Parenteau and others at the Montreal Polytechnique. Their research shows that they can halve the number of lines of code to express a given algorithm, sometimes even less. And the parallelism of the code is both clearer, cleaner and easier to write than in C or Fortran. For more words of wisdom, I suggest you have a look at their excellent papers, some of which are on the Chapel web-site including that from CHIUW 2021. You could even contact them. They could help you avoid reinventing the wheelI am fairly sure that their CFD code is also finite volume. Their Chapel experience reflects mine, the difference being that theirs is more extensive and mine is focused on finite elements for structural analysis and fluid structure interaction.Their papers currently focus on the CFD aspects which obviously is what gives their work the CFD foundation and credibility they desire. One day, they might write a paper dealing with the programming philosophy aspects of how they attack the various underlying algorithms, an aspect that the journals in which they publish are probably less interested, even if that aspect of their work is of great(er) interest to the Chapel community. They might be presenting at CHIUW 2022.
As I mentioned, the ongoing improvements to Chapel's vectorization of code might help in your own work. But I will leave it to @mppf to elaborate on that or provide advice thereon. Maybe we might find out more about the state of play on vectorization within Chapel at CHIUW 2022.
Some background would be good:
Regarding vectorization, we haven't got everything nailed down in the compiler, but the language design is that you should use foreach
on a loop that you know is order independent (which I would imagine would match with a loop that you want to have vectorized). However at a quick glance looks to me like both of the for loops in the OP might have loop carried dependencies through arrays so shouldn't be written with foreach
.
I could be wrong but I think the inner loop can be written with a foreach. I rewrote all the operations seen in the inner loop as a vector operation. There are 23 const vectors and 6 var vectors, the l;atter being entire rows of the VISC[xy]
, CONV[xy]
and [uv]star
data. Non-trivial. Once the serial evaluation is complete, the outer loop can be converted to a forall.
- Is the Fortran code to which you are comparing, pre-existing mature optimized and./or vectorized code? = I implemented the code with my advisor, to test the numerical method. I don't know if this is the best version, but it has been checked and converges.
- Is this for a thesis or some other project? = conference in Brazil. We are "testing" the chapel. My advisor loves fortran.
- What sized problems are you attacking? = Hige Re, turbulence flow
- Other relevant information? = Solve NS equation 3D, incompressible It is probably a good idea to stick to English = I will provide it.
Regarding vectorization, we haven't got everything nailed down in the compiler, but the language design is that you should use
foreach
on a loop that you know is order independent (which I would imagine would match with a loop that you want to have vectorized). However at a quick glance looks to me like both of the for loops in the OP might have loop carried dependencies through arrays so shouldn't be written withforeach
.
Hi, did you work with compressible flow or did you implement some version of the code in serial?
Hi, did you work with compressible flow or did you implement some version of the code in serial?
I was just answering questions @damianmoz was asking about vectorization. I was noting what appear to be loop carried dependencies in the Chapel code in the top of this issue -- https://github.com/chapel-lang/chapel/issues/19636#issue-1205730418
The code, as provided, is incomplete.
The time integration computation is missing, and any update of un
and vn
over timeis not visible.
Do you have the fully running program? The code segment looks like 2D. But you say your problem is 3D.
When I asked about what size problems are you addressing, I meant what is the number of finite volume cells. What is the value of Nx and Ny in practice? Surely a 20x20 cell mess will not model much?
sorry!!! First I implement 2D.
I cant upload the file in .chpl format See the .text format
Thanks. 2D makes more sense.
Sorry. I forgot to look at the original .txt attachments. It makes more sense now.
By the way, did you make the changes suggested by @ty1027. As far as I can tell, those changes should have given you performance almost the same as Fortran. Or did I misread that reply?
Which of the various passes consumes the most time? My initial assumption was that it was Pass 1, if only because that was the text you explicitly included in the issue.
Thanks for your input @mppf
If I defined the following generic functions
inline proc fc(r, m, l, t, ij, p, q)
{
return (((r - m) + (l - m)) * p) + ((t - m) + (ij - m)) * q;
}
inline proc fc(z, c, m, l, t, ij, p, q)
{
return (((t - ij) * ((t + ij) + (m + m))) * p) + (c - (l + m) * z) * q;
}
I believe that I could rewrite Pass 1 using
const rqdx = quarter / dx, rdx2 = one / (dx * dx);
const rqdy = quarter / dy, rdy2 = one / (dy * dy);
for i in 2..Nx-1 do
{
const ref uni = un[i, ..];
const ref unip1 = un[i+1, ..];
const ref unip2 = un[i+2, ..];
const ref vnim1 = vn[i-1, ..];
const ref vni = vn[i, ..];
const ref vnip1 = vn[i+1, ..];
ref vxip1 = VISCx[i+1, ..];
ref vyi = VISCy[i, ..];
ref cxip1 = CONVx[i+1, ..];
ref cyi = CONVy[i, ..];
ref usip1 = ustar[i+1, ..];
ref vsi = vstar[i, ..];
foreach j in 2..Ny-1 do
{
const jp1 = j + 1, jm1 = j - 1, jp2 = j+2;
const ur = unip1[jp1], um = unip1[j], ul = unip1[jm1];
const vr = vnip1[jp1], vm = vni[jp1], vl = vnim1[jp1];
const uij = uni[j], ui2j = unip2[j], uijp = uni[jp1] + uij;
const vij = vni[j], vij2 = vni[jp2], vijp = vnip1[j] + vij;
const vx = fv(ur, um, ul, ui2j, uij, rdy2, rdx2);
const vy = fv(vr, vm, vl, vij2, vij, rdx2, rdy2);
const rm = (ur + um) * (vr + vm);
const cx = fc(vijp, rm, um, ul, ui2j, uij, rqdx, rqdy);
const cy = fc(uijp, rm, vm, vl, vij2, vij, rqdy, rqdx);
usip1[j] = um + dt * (z * vx - cx);
vsi[jp1] = vm + dt * (z * vy - cy);
vxip1[j] = vx;
vyi[jp1] = vy;
cxip1[j] = cx;
cyi[jp1] = cy;
}
However, should I now write the inner loop with a foreach, i.e.
foreachj j in 2..Ny-1
On the other hand, I could rewrite the whole inner loop as array operations, i.e.
const j = 2..Ny-1, jp1 = 3..Ny, jm1 = 1..Ny-2, jp2 = 4..Ny+1;
const ur = unip1[jp1], um = unip1[j], ul = unip1[jm1];
const vr = vnip1[jp1], vm = vni[jp1], vl = vnim1[jp1];
const uij = uni[j], ui2j = unip2[j], uijp = uni[jp1] + uij;
const vij = vni[j], vij2 = vni[jp2], vijp = vnip1[j] + vij;
const vx = fv(ur, um, ul, ui2j, uij, rdy2, rdx2);
const vy = fv(vr, vm, vl, vij2, vij, rdx2, rdy2);
const rm = (ur + um) * (vr + vm);
const cx = fc(vijp, rm, um, ul, ui2j, uij, rqdx, rqdy);
const cy = fc(uijp, rm, vm, vl, vij2, vij, rqdy, rqdx);
usip1[j] = um + dt * (z * vx - cx);
vsi[jp1] = vm + dt * (z * vy - cy);
vxip1[j] = vx;
vyi[jp1] = vy;
cxip1[j] = cx;
cyi[jp1] = cy;
Which is better, whatever that word means? What is likely to be faster, if not today, at least in the future?
I looked at the OP and was able to bring the time from 2.78 to 1.05 seconds. Some problems are tricky, some are more obvious. I don't know Fortran and haven't tried it myself. Your execution times in the OP are also super fast, and I would try to avoid making comparisons with microseconds.
The "final" code that has some extra timers for profiling is here.
Quick summary:
version | seconds |
---|---|
base | 2.78 |
pass 3: rank change to slice | 2.53 |
pass 3: drop extra assignment | 2.50 |
calc_erro: drop extra init | 2.48 |
pass 3: use -= for promoted expr | 1.76 |
use custom frobeniusNorm | 1.09 |
frobeniusNorm: consume/reset the argument | 1.05 |
diff --git a/cavity.chpl b/cavity.chpl
index 7627f3a..abc1321 100644
--- a/cavity.chpl
+++ b/cavity.chpl
@@ -131,10 +131,10 @@
while (errop >= tol) do{
timers[6].start();
- pnew[1,2..Ny-1] = pnew[2,2..Ny-1];
- pnew[2..Nx-1,1] = pnew[2..Nx-1,2];
- pnew[Nx,2..Ny-1] = pnew[Nx-1,2..Ny-1];
- pnew[2..Nx-1,Ny] = pnew[2..Nx-1,Ny-1];
+ pnew[1..1, 2..Ny-1] = pnew[2..2, 2..Ny-1];
+ pnew[2..Nx-1, 1..1] = pnew[2..Nx-1, 2..2];
+ pnew[Nx..Nx, 2..Ny-1] = pnew[Nx-1..Nx-1, 2..Ny-1];
+ pnew[2..Nx-1, Ny..Ny] = pnew[2..Nx-1,Ny-1..Ny-1];
timers[6].stop();
timers[7].start();
This is a place where we as Chapel could do better. Workaround is relatively easy. This brought execution time to 2.53 seconds.
diff --git a/cavity.chpl b/cavity.chpl
index abc1321..3757e2e 100644
--- a/cavity.chpl
+++ b/cavity.chpl
@@ -138,7 +138,7 @@
timers[6].stop();
timers[7].start();
- pnew = sor_method(pnew, rhs, Nx, Ny, dx2, beta, w); // metodo SOR
+ sor_method(pnew, rhs, Nx, Ny, dx2, beta, w); // metodo SOR
erro = calc_erro(pnew, rhs, Nx, Ny, dx2, dy2); // residuo
timers[8].start();
errop = norm(erro, p); // norma do residuo
sor_method already modifies in-place, you don't need assignment here. We're at 2.50 seconds.
diff --git a/cavity.chpl b/cavity.chpl
index 3757e2e..0ff9aa0 100644
--- a/cavity.chpl
+++ b/cavity.chpl
@@ -286,7 +286,7 @@ pp.close();
proc calc_erro(f, tf, imax: int, jmax: int, dx2: real, dy2: real){
var errol : [1..Nx,1..Ny] real;
- errol = 0.0;
+ /*errol = 0.0;*/
for i in 2..imax-1 do{
for j in 2..jmax-1 do{
Don't need that assignment here. This is in critical path with a lot of repetition. Arrays inherit their element type's default value, here 0.0. 2.48 seconds.
diff --git a/cavity.chpl b/cavity.chpl
index 0ff9aa0..af5fb45 100644
--- a/cavity.chpl
+++ b/cavity.chpl
@@ -144,7 +144,7 @@
errop = norm(erro, p); // norma do residuo
timers[8].stop();
itera = itera + 1;
- pnew = pnew - pnew[2,2]; // normalização da pressao
+ pnew -= pnew[2,2]; // normalização da pressao
timers[7].stop();
//writef('%i %6.12dr\n', itera, errop);
this one's critical. The old code was doing full-array subtraction, then a full-array assignment. Arguably, we can do better here. But for now, you should use -=
because it is a single operation. I can argue that it is better style, anyways. This was a big one, we are at 1.76 seconds
Edit: My theory above was wrong. There won't be 2 full-array ops, but just one promotion. But that promotion will use 2D zippering and will suffer from https://github.com/chapel-lang/chapel/issues/13147. See my comment below.
serial
herediff --git a/cavity.chpl b/cavity.chpl
index af5fb45..230e8e1 100644
--- a/cavity.chpl
+++ b/cavity.chpl
@@ -141,7 +141,8 @@
sor_method(pnew, rhs, Nx, Ny, dx2, beta, w); // metodo SOR
erro = calc_erro(pnew, rhs, Nx, Ny, dx2, dy2); // residuo
timers[8].start();
- errop = norm(erro, p); // norma do residuo
+ /*errop = norm(erro, p); // norma do residuo*/
+ errop = frobeniusNorm(erro); // norma do residuo
timers[8].stop();
itera = itera + 1;
pnew -= pnew[2,2]; // normalização da pressao
@@ -152,6 +153,13 @@
}
timers[2].stop();
+
+ proc frobeniusNorm(arr) {
+ var sum = 0.0;
+ for a in arr do
+ sum += abs(a)*abs(a);
+ return sqrt(sum);
+ }
This adds a straightforward frobenius norm without any parallelism and uses that instead of LinearAlgebra's norm
. Things could get better here, we can improve how the bodies of serial
statements are implemented. I see serial
blocks like twice a year, so it'd be difficult to prioritize that, especially because it is probably not very easy to do. Anyways, with this function we are at 1.09 seconds
I am getting into the weeds of algorithmic improvements here.
diff --git a/cavity.chpl b/cavity.chpl
index 6dc5b25..01f6cdf 100644
--- a/cavity.chpl
+++ b/cavity.chpl
@@ -140,11 +140,11 @@
timers[7].start();
timers[9].start();
sor_method(pnew, rhs, Nx, Ny, dx2, beta, w); // metodo SOR
- erro = calc_erro(pnew, rhs, Nx, Ny, dx2, dy2); // residuo
+ calc_erro(erro, pnew, rhs, Nx, Ny, dx2, dy2); // residuo
timers[9].stop();
timers[8].start();
/*errop = norm(erro, p); // norma do residuo*/
- errop = frobeniusNorm(erro); // norma do residuo
+ errop = frobeniusNormAndConsume(erro); // norma do residuo
timers[8].stop();
timers[10].start();
itera = itera + 1;
@@ -158,10 +158,12 @@
}
timers[2].stop();
- proc frobeniusNorm(arr) {
+ proc frobeniusNormAndConsume(arr) {
var sum = 0.0;
- for a in arr do
+ for a in arr {
sum += abs(a)*abs(a);
+ a = 0;
+ }
return sqrt(sum);
}
@@ -292,9 +294,9 @@ pp.close();
}
}
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////
- proc calc_erro(f, tf, imax: int, jmax: int, dx2: real, dy2: real){
+ proc calc_erro(errol, f, tf, imax: int, jmax: int, dx2: real, dy2: real){
- var errol : [1..Nx,1..Ny] real;
+ /*var errol : [1..Nx,1..Ny] real; */
/*errol = 0.0;*/
for i in 2..imax-1 do{
@@ -305,7 +307,7 @@ pp.close();
}
}
- return errol;
+ /*return errol;*/
}
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////
You don't need a new array for calc_err, it can just use/set erro
in-place. But you need to reset it back to 0. This patch taps the custom frobeniusNorm
to make the resetting for us. Probably not what you want to do for maintainable code, but the idea is that on a quick look, next things to try are such "algorithmic" changes where you reuse existing allocations/values. This brings me down to 1.05 seconds.
Wow @e-kayrakli. I learnt a lot today. I was blown away by the first one. That's a 9% saving. It would be nice if Chapel can handle that better because even my own code is riddled with statements like
[const] ref slice = array[i, j, ...];
Amazed that I have never seen a reference to this before in nearly 10 years.
!!! There is a dire need to highlight to Fortran users that they need to refactor code to use +=
, -=
, *=
and /=
.
Should we use a foreach for code like that custom Frobenius norm?
I abhor the concept of using any sort of default initialization except to some undefined value. Does code like
var erro1 : [1..Nx, 1..Ny] real = 0.0;
incur any overhead in Chapel?
Thanks again @e-kayrakli for such fabulous insight.
@annacfs, as I mentioned, there are also many places where you can avoid divisions and multiplications throughout your code.
It would be nice if Chapel can handle that better
I agree. I don't have a good guess as to what the root cause of the difference is. Some missing optimization? An inherent challenge with rank-change views? Something else?
Note that, you can also do away with slices altogether and write the loops yourself. Even with slices, there'll be some overhead there. We have been wanting to cut that overhead for a while by strength-reducing the slice operation there. But it hasn't made to the top of our priority list.
!!! There is a dire need to highlight to Fortran users that they need to refactor code to use +=, -=, *= and /=.
I don't disagree. At times we think about a "Performance Guide in Chapel". We may include some gotchas like this there. However, this maybe due to a missing implementation in our compiler and I may have mischaracterized the issue in my original comment (will update after posting this). I've tried to reconvince myself and failed.
I think the root cause is because the array in question is two dimensional. The compiler will turn
A = A + 3; // A is an array
into
forall (x,y) in zip(A, A) {
x = y + 3;
}
and this will bump into https://github.com/chapel-lang/chapel/issues/13147. We discussed this today in an internal meeting. I came off optimistic that we can cover straightforward cases like these rather easily. I can not tell whether we'll be able to prioritize it, at the moment.
Should we use a foreach for code like that custom Frobenius norm?
In the future, yes. That loop reduces on the outer variable sum
. To express that we need vector-lane private variables in foreach
via foreach
intents which we don't support today. Something like:
var sum = 0.0;
foreach a in arr with (+ reduce sum) do
sum += abs(a)*abs(a);
See https://github.com/chapel-lang/chapel/issues/18500 for more on that.
I abhor the concept of using any sort of default initialization except to some undefined value. Does code like
var erro1 : [1..Nx, 1..Ny] real = 0.0;
incur any overhead in Chapel?
I am not aware of an optimization to check whether the rhs of that statement is actually the default value of the element type of the array. So, my guess is "yes", that will be an array allocation, followed by default initialization to 0.0, followed by your initialization to 0.0. But, based on your comment, you might be interested in the noinit
keyword.
For the case like the one above, we can think of adding that optimization at least for cases where the rhs is param/immediate. If it isn't, I wouldn't be as supportive for that optimization because it would require a dynamic check and code cloning, which are not cost-free.
@annacfs, as I mentioned, there are also many places where you can avoid divisions and multiplications throughout your code.
FWIW, I tried to hoist the denominator in sor_method outside the nested loops. Improvement was within the margin of error. There must be some more complex optimizations than that as Damian is alluding to, though.
For now, we just need some rules for how to slice optimally. That said, it really seems strange that more than a decade after Chapel arrived on the scene that
const ref ari = a[i, ..];
it not the optimal way to refer to a contiguous chunk of memory that is the row of a 2D array. It just makes the most sense. I have been reading technicsal papers slicing that way for 30+ years. Are you saying that the optimal way to swap rows k
and j
of an array a is
a[k..k, ..] <=> a[i..i, ..];
I like the idea of a performance guide for Chapel. Writing such a document will also help highlight things that are "not right" and help prioritize their being addressed. Forcing people to refactor their code to use what you call lightweight operators
is not really too onerous, in my humble opinion. That is, as long as they are told to do it beforehand with examples.
On the concept of initialization, if I could be selfish, I would prefer noinit to be applied everywhere and it to be regarded as an error when a variable is used BEFORE it is used. That problem was fixed in the 70s and then C came up with the idea of initializing everything to zero and it sadly won that battle and too many lazy people prefer that approach. I/we do not allow implicit initialization in our place and use code reviews to address that. For people migrating from Fortran, they have been accustomed to a language which says .... The value of every variable and array element is undefined at the start of program execution unless it has been initialised in a DATA statement. Undefined values may be used in executable statements only in ways which cause them to become defined... And, there is also
-finit-real=<zero|inf|-inf|nan|snan>
available to gfortran users that might be useful to implement at some stage for chpl to make Chapel more attractive to Fortran users who want to migrate. The only Fortran programmers I have ever met who wanted a way to initialize all variables to zero were really, really bad programmers. I have never understood why people want to initialise to zero because so few of my variables get given that value initially anyway.
Also, what of the three following is best
a[1..n] = b[1..n] * c[1..n];
foreach i in 1..n do a[i] = b[i] * c[i];
foreach (p, q, r) in zip(a[1..n], b[1..n], c[1..n]) do p = q * r;
When somebody tells me that Chapel is doing vectorization really, I will phrase that question differently.
@annacfs, maybe I have misunderstood but is seems that in sor_method
, you
pnew
as a ref and use it as f
within the loopsf
as the value of the function namepnew
Why not just use a void function
pnew[1,2..Ny-1] = pnew[2,2..Ny-1];
pnew[2..Nx-1,1] = pnew[2..Nx-1,2];
pnew[Nx,2..Ny-1] = pnew[Nx-1,2..Ny-1];
pnew[2..Nx-1,Ny] = pnew[2..Nx-1,Ny-1];
sor_method(pnew, rhs, Nx, Ny, dx2, beta, w); // metodo SOR
which avoids the copy. Then again, the optimizer might do that for you.
Also, you store the residual in erro
when you then proceed to compute the Frobenius norn from that same array erro
. But then you never use erro
again. Why not compute that norm in-place, i.e. during calc_erro
_ when you compute each residual. That way, you never need the erro
array. Or have I misunderstood?
Just FYI,
Why not just use a void function
This is item 2 in my list.
Also, you store the residual in erro when you then proceed to compute the Frobenius norn from that same array erro. But then you never use erro again. Why not compute that norm in-place, i.e. during calcerro when you compute each residual. That way, you never need the erro array. Or have I misunderstood?
This is more or less number 6 in my list.
For now, we just need some rules for how to slice optimally. That said, it really seems strange that more than a decade after Chapel arrived on the scene that
const ref ari = a[i, ..];
it not the optimal way to refer to a contiguous chunk of memory that is the row of a 2D array.
Note that the overhead is constant and hopefully negligible if you have large enough data. I'll also add that "contiguous chunk of memory" is somewhat nebulous in a language with distributed arrays. Which is always an important part of the equation when you think about cutting down some overheads.
Also, what of the three following is best
a[1..n] = b[1..n] * c[1..n]; foreach i in 1..n do a[i] = b[i] * c[i]; foreach (p, q, r) in zip(a[1..n], b[1..n], c[1..n]) do p = q * r;
Without context, the top one. It is the most portable. It'll run and scale fine for distributed arrays. When I see foreach
without context, the question is always "why not forall?". foreach
is not parallel like forall
(neither uses multiple tasks, nor cares about locality), it just signals that the loop-body is order-independent. forall
does all of that, too, and more. Given that what you are iterating over are ranges or array slices, using forall
here will also guarantee order-independence and vectorization.
My apologies for repeating your Item 2 and Item 6. That will teach me to look at something late on a Friday evening after a long hard week.
I will also wait until @annacfs posts her updated source code which incorporates all the suggested changes to that code.
While not directly related to the work of @annacfs, several points above.
Firstly, could the last dimension of an N-dimensional array ever be spread across multiple processors?
What is the difference between a[i..i, j..k]
and a[i, j..k]
. Both refer to a single row of the array a
. Both are mathematically a 1x(k-j+1) array although the first has rank 2 computationally.
The reason why I use foreach is to ensure a loop is never parallel, because I know that this foreach is highly likely to itself be inside a forall loop. I always thought a forall inside another *forall was a big no-no because you would get the processors to fight with each other. Or is the compiler smart enough to not parallelize a forall it sees within another forall. Mind you, that is impossible if the internal forall is hidden inside an external routine.
I had a deeper look at the SOR algorithm and I got reminded that this algorithm is not able to be written in parallel in a cache-friendly wau in Chapel. Even the small amount of vectorization possible has too much in terms of branching overhead except for tiny problems.
@e-kayrakli, thanks for the insight.
@annacfs, your work is very instructive and will make an ideal example for the Chapel Performance Guide.
As I said, it would be good if you could publish the code with all the changes suggested to date, and if not too much work, comments in English rather than Portuguese.
In my enhanced version of your Suggestion 6, I meant that you drop calc_erro
and the use of any norm routine and compute errop
directly from pnew
and rhs
. That avoids 2 Nx*Ny memory accesses, the same number of computation of the absolute value and is far more cache-hit friendly. So:
errop = residual_error(pnew, rhs, one/dx2, one/dy2);
where one has the same type as dx2 and dy2 so is something like
param one = 1:real;
Note that I have assumed a future release of the compiler where the slice definitions for Chapel
x[i..i, j..k] AND x[i, j..k]
produce identical results and where the 2nd is at least as optimal as the 1st, if not more so.
This probably allows me to write something like the following (undebugged) code:
proc residual_error(f : [?D] ?R, tf : [D] R, rdx2: R, rdy2 : R)
{
proc sum(v : [?vD] R)
{
var s = 0:R;
for t in v do s += t * t; // make this a foreach so it is vectorized - no I do NOT mean Chapel's norm routine
return s;
}
const (r, c) = D.dims();
const nx = r.high, ny = c.high;
const rjm1 = 1..ny-2, rj = 2..ny-1, rjp1 = 3..ny;
var erow = [2..nx-1] R;
for i in 2..nx-1 do // this should be a forall when you want parallel performance
{
const ref fim1 = f[i-1, ..];
const ref fi = f[i, ..];
const ref fip1 = f[i+1, ..];
const ref tfi = tf[i, .. ];
const ref vjm1 = fi[rjm1];
const ref vj = fi]rj];
const ref vjp1 = fi[rjp1];
const t = tfi - ((fim1 - fi) + (fip1 - fi)) * rdx2;
const s = t[rj] - ((vjm1 - vj) + (vjp1 - vj)) * rdy2;
erow[i] = sum(s * s);
}
return sqrt(sum(erow)); // this should be a reduce when you want parallel performance
}
I am assuming that those 2 vector statements to compute t
and s
are the optimal expression of that mathematics and that individual foreach statements would not be faster. I have no/too-little/some/lots experience with 1.26/1.25.1/1.22/1.18 so my statement could be totally without any firm foundation in terms of the state of the Chapel compiler today.
Note that I have assumed a future release of the compiler where the slice definitions for Chapel
x[i..i, j..k] AND x[i, j..k]
produce identical results and where the 2nd is at least as optimal as the 1st, if not more so.
I agree that we should have good performance for those two expressions. However, as a somewhat pedantic comment: these are not identical semantically. The first one is a "slice view" where the other is a "rank-change view". The implication is that the first one is a 2D view, and if you assign it to a var x
, that x
will be a 2D array (where one dimension is of size 1). And for example, if you want to access x
, you'll have to use 2D indices. The second one is a true 1D view, and if you assign it to a var
it'll be a 1D array.
So, it is not totally unimaginable for the two have slightly different performance. But I don't think we have spent enough time tuning those to claim that the difference you see today is because of that inherent difference.
Thanks for the explanation. Very clear.
My use of the word slice is the more general which comes from Algol68 in the early 70s and permeated into vector Fortran of the mid 70s before if became standardized as Fortran 9X). In mathematics, you can partition a matrix into sub-matrices or into individual rows or columns which are treated as vectors. This corresponds, in Algol68, to a slice of (say) a 2D array into a 2D sub-array or a slice it into a 1D array.. Algol68 was not alone in using the same concept for both types of operations. PL/1 did the same, but called them cross-sections. In all cases, such slicing (or cross-sectioning) worked for N-dimensional matrices and it depended on what index you held constant (or how you wrote it if you wanted a sub-matrix (2D array)).
Sorry, old habits die hard. I will try and be more precise in the future.
The penny dropped after 10 years of using Chapel. Chapel has views into an array, not slices.
(conventional) trimmed slice = (chapel) sliced view
where a trimmer is what Chapel calls being indexed by a range i..j (in Chapelese) and
(conventional) indexed slice = (chapel) rank-change view
where an index in conventional-speak is just an integral value.
But then reading the 1.26 documentation it seems like it is more like
(conventional) trimmed slice = (chapel) slicing
where slicing is done by ranges in all dimensions by ranges and
(conventional) indexed slice = (chapel) slicing with rank-change
where slicing in at least one dimension is an integral value.
Arrrgh. I am starting to get lost in the semantics.
@annacfs, has any of this discussion so far been of use?
In case what we are suggesting is not useful, can you share both your project scope and project plan with us? Do you have any intermediate milestone reports that you can share so that others can benefit from your learning experience with Chapel? I am personally curious at the problem size, the number of cells/elements/volumes and even the problem you are trying to address? Is it atmospheric flow?
What percentage of the overall task is going to be spent parallelising the problem? Do you have a plan for which parallel SOR algorithm you plan to use or will you use some alternative algorithm. Is that the chunk of code that will consume the most time spent? Do you expect the time spent accessing memory to dominate that spent on floating point operations? Also, why the level of complexity in computing the error during the solution of the Poisson equation? Are you trying to do some prediction of the next step to avoid that last iteration prior to convergence? Why does not the difference of successive values of pnew
not suffice? Just curious as that would save time?
Do you have any words of wisdom for others who are considering translating a Fortran program into Chapel? Do you already have a Fortran solution to the 3D problem?
Thanks.
Hi Guys, Sorry for the delay, I checked my code with MMS([https://asmedigitalcollection.asme.org/fluidsengineering/article-abstract/124/1/4/462791/Code-Verification-by-the-Method-of-Manufactured?redirectedFrom=fulltext]). Its correct!!! My question was the time of the CONV and VISC loop. Just it.
I did a new implementation using domSpace
I was not questioning the correctness of the code. By the way, the 2nd link does not resolve. It gets an error.
Could you post the latest version of your code please?
You mentioned
My question was the time of the CONV and VISC loop. Just it.
Are you saying that your questions about Chapel are related solely to that loop?
I assume you have moved on from your serial comparison.
You can simplify the CONV and VISC loop algebraically, and that then allows you to have a simpler code which should ultimately vectorize (when Chapel better supports vectorization). Even so, it should run faster even without vectorization.
That also means the outer for loop involving CONV & VISC can be parallelised as a forall.
Can you post your reworked code? Do you mean you used dmapped space?
Given that @annacfs has been quiet about our points here for a while now, I think we should close this issue.
@damianmoz do you agree? @annacfs should feel free to reopen, if they intend to continue the conversation and respond to our comments.
I agree to close this issue. It has highlighted and resolved quite few of the issues that people moving from Fortran code face. Quite a useful and helpful issue. Your input was key to those resolutions Engin, Thanks.
Resume
I'm implementing finite volume code to solve navier stokes equations. The objective of this work is to test the chapel language for a fluid mechanics solution.
Array loops consume a lot of simulation time compared to fortran. Is there an efficient method in the chapel?
` For example: In chapel:
Time: 1.0900E-04
In Fortran: ` program cavity2D
Time: 1.9000E-05
to be continued...
` Time: CPU TIME CHAPEL = 1.1000 E -05 Viscoso e convectivo CPU TIME CHAPEL = 1.0220 E -03 Viscoso e convectivo CPU TIME CHAPEL = 2.0370 E -03 Viscoso e convectivo CPU TIME CHAPEL = 3.0460 E -03 Viscoso e convectivo CPU TIME CHAPEL = 4.0910 E -03 Viscoso e convectivo CPU TIME CHAPEL = 5.0920 E -03 Viscoso e convectivo CPU TIME CHAPEL = 6.1020 E -03 Viscoso e convectivo CPU TIME CHAPEL = 7.1030 E -03 Viscoso e convectivo CPU TIME CHAPEL = 8.1160 E -03 Viscoso e convectivo CPU TIME CHAPEL = 9.1330 E -03 Viscoso e convectivo CPU TIME CHAPEL = 1.0133 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.1143 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.2148 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.3149 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.4147 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.5160 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.6157 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.7152 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.8165 E -02 Viscoso e convectivo CPU TIME CHAPEL = 1.9168 E -02 Viscoso e convectivo CPU TIME CHAPEL = 2.0166 E -02 Viscoso e convectivo CPU TIME CHAPEL = 2.1163 E -02 Viscoso e convectivo CPU TIME CHAPEL = 2.2166 E -02 Viscoso e convectivo CPU TIME CHAPEL = 2.3161 E -02 Viscoso e convectivo CPU TIME CHAPEL = 2.4160 E -02 Viscoso e convectivo CPU TIME CHAPEL = 2.5158 E -02 Viscoso e convectivo CPU TIME CHAPEL = 2.6156 E -02 Viscoso e convectivo
In Fortran: `
CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 4.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo CPU TIME FORTRAN = 3.0000 E -06 Viscoso e convectivo
Compile command: source chplenv.sh chpl --fast cavity2D.chpl gfortran -o post cavity2D.f90
Execution command:
./cavity2D ./post
Associated Future Test(s):
Configuration Information
chplenv.sh
Files can be download from:
https://github.com/annacfs/RansChapel/tree/annacfs-cavity2d
cavity2D_chpl.txt cavity2D_f90.txt chplenv_sh.txt
@thomasrolinger @buddha314 @ty1027