ali-ramadhan commented 4 years ago

I think for Oceananigans it may not matter much, but it seems to matter a lot for the atmosphere so maybe it would be good to match up with the atmosphere so we can share operators.

Here's @thabbott's argument for making z the fast index for an atmospheric LES model (taken from https://github.com/thabbott/JULES.jl/pull/16):

I think that there are three strong argument to be made in favor of making z the fast index for this model:

Radiative transfer operates in columns

Rain falls downward relative to the air it's embedded in

Vertical momentum is solved implicitly within columns on the acoustic time step

This means that these three calculations can operate completely independently over different columns and (at least for calculations using a CPU) it's more efficient to store variables in the same column close to each other in memory. I don't think we have to change the function signatures relative to Oceananigans to make z the fast index, though---we just have to change the indexing order in differences.jl and interpolation.jl.

I think @vchuravy said that this might change with GPUs and/or shared memory, so maybe it's not super clear whether the change is worth making. Although for CPUs it will probably help a lot.

glwagner commented 4 years ago

The only thing we need to do to achieve this is to make a wrapper array type with new getindex that permutes indices the way we want.

We can even use a different permutation on the CPU and GPU.

For loop ordering, we should write a macro that prints the loop statements. That way we can globally change the loop ordering by changing one line if we want to.

christophernhill commented 4 years ago

Hmmm - I am skeptical about that from a computer point of view? In an implicit algorithm and/or radiative transfer alg the next step often depends on the result of the previous steps.

As every CCE graduate student learns, computers take many tens of cycles to evaluate an operation like a an add or multiply. The operation is sequential and carried out in a multi-stage pipeline in the heart of a CPU (or GPU). So unless the compiler has something else for the processor to do, the processor will have to wait for one step to make it through the pipeline before the next step?

I think the normal way to do this is to have some inner horiz blocking that is flexible (and can be 1,1) and then iterate over levels with some intermediate stores? The horiz block can be some fraction of inner cache or GPU local proc shared mem.

The math doesn't quite look at this way because it assumes that a+b and/or ab etc.. happen "instantaneously". It does not take into account that the awnser from ab might take 5-10 clock cycles to pass through the CPU floating point unit.

I think that is fairly generally true? Functional style code in Julia should make it possible to express this in a fairly clean way, but with flexibility to change blocking for different target arch.

christophernhill commented 4 years ago

P.S. that was a comment on the @ali-ramadhan comment, not the @glwagner comment. Greg is correct that some code generation can help, although sometimes its cleaner just to write elegant code than get carried away with meta-programming ( e.g. see Steve J meta-programming regrets talk https://www.youtube.com/watch?v=mSgXWpvQEHE ).

thabbott commented 4 years ago

In hindsight I think that @christophernhill is probably right about pipeline stalls becoming a bottleneck if we use z as the fast index. (Which leaves me unsure why it's common in the atmospheric models I've used.... Maybe just prioritizing simplicity over performance?)

On Sat, Oct 12, 2019, 5:33 PM Chris Hill notifications@github.com wrote:

P.S. that was a comment on the @ali-ramadhan https://github.com/ali-ramadhan comment, not the @glwagner https://github.com/glwagner comment. Greg is correct that some code generation can help, although sometimes its cleaner just to write elegant code than get carried away with meta-programming ( e.g. see Steve J meta-programming regrets talk https://www.youtube.com/watch?v=mSgXWpvQEHE ).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/climate-machine/Oceananigans.jl/issues/470?email_source=notifications&email_token=ACZDSTS5AH2A5CC2HC43EELQOI7CRA5CNFSM4JAEMGO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBCIYQQ#issuecomment-541363266, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZDSTWNQED7NDAS2U5KTETQOI7CRANCNFSM4JAEMGOQ .

glwagner commented 4 years ago

@christophernhill the macro I am suggesting is just simple copy-and-paste (no manipulation of AST's, etc) --- for the purpose of reducing the large amount of boiler plate associated with our loop / kernel definitions. If we do that we can change the ordering of 3D loops by changing the macro, rather than by changing every kernel function.

But I hear you. I wonder what you think about PR #463...

glwagner commented 4 years ago

Functional style code in Julia should make it possible to express this in a fairly clean way, but with flexibility to change blocking for different target arch.

Right --- we can use a wrapper that redefines getindex (and potentially also is associated with a certain loop ordering). Looks like there's a julia type that does this:

https://github.com/JuliaLang/julia/blob/master/base/permuteddimsarray.jl

I doubt it would be very hard to figure out how to abstract away loop ordering so it can be flexibly changed for a target architecture.

One question still remains: which is the most intuitive convention for getindex calls? I personally like i, j, k for x, y, z.

christophernhill commented 4 years ago

@thabbott I am not sure about other atmos codes either. In past where long vectors were the norm I think they must have had to do some blocking thing? These days on both GPU and AVX512 and beyond world seems like that would be good.

463 looks like a good use of some macros.

CliMA / Oceananigans.jl

Should we make z the fast index? #470

463 looks like a good use of some macros.