IntelLabs / ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
BSD 2-Clause "Simplified" License
294 stars 32 forks source link

More stencil examples: 3D Finite Difference #97

Open thoth291 opened 8 years ago

thoth291 commented 8 years ago

There are two interesting articles in the Intel's Blog:

I like this one as it addresses most common optimizations for such algorithms: https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso

This one is also good and should be considered as well: https://software.intel.com/en-us/articles/understanding-numa-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso

I was wondering if your team plan to work on these examples and show some comparison on Zeon and Zeon Phi architectures.

P.S. Just by comparison of the code readability of the initial version of the code and the final one - I wonder if ParallelAccelerator could deliver better readability keeping the performance more or less optimized.

Thanks in Advance!

ninegua commented 8 years ago

Thank you for the suggestion, @thoth291 . Indeed we were well aware of the link you gave, and I share the same belief that much of the heavy lifting at optimizing stencils can be automated by a compiler like ParallelAccelerator. That being said, we just hadn't have the time to do a deep dive into stencils. The most obvious opportunity we can easily reap is cacheline blocking.

Let me give you some brief intro to how stencil is implemented within ParallelAccelerator. We gave a sequential implementation that describes the semantics of the stencil DSL as part of api-stencil.jl. In the same file, we also gave a macro translation that is a faster sequential implementation api-stencil.jl. The macro translation is turned on if you have @acc but run program with environment setting PROSPECT_MODE=none. Our parallel translation is not much different from the macro one, except that it builds typed AST and use other internal facilities from ParallelAccelerator and CompilerTools.

So yes, I'd say there are some low hanging fruit here. But in general, there exists many different stencil optimization techniques, and only a few was given in the article you quoted. We have not made a decision to pick a direction, mostly because our development was driven by workloads. You are also welcome to contribute if there is a stencil workload that you are interested in optimizing.

thoth291 commented 8 years ago

I think I'll need to know a lot more about Julia in order to do somewhat reasonable contribution. Most of the stencils I deal with are came from finite difference or wavelet transform.

So having something as part of the Julia package which could simplify prototyping some really weird FD or WT codes while keeping performance not suffer from readability is a big thing for me.

I'm still looking for a framework to do all my experiments - but I think that ParallelAccelerator so far is a closest option out of all what I have looked at so far...

Thanks for your reply - once I'll have some free time - I'll implement few examples in C and Julia - to see the comparison... Then there will be something in particular to discuss.

P.S. But please consider improving stencil api - as I believe this is a really one of the most promising features.

ChrisRackauckas commented 7 years ago

I'd be interested in optimized general nD Laplacian stencils (for DifferentialEquations.jl). I don't know if I'd like to have the entirety of ParallelAccelerator as a dependency for that though.

timholy commented 7 years ago

@thoth291, there are some examples of this in ImageFiltering, including a generic multithreaded implementation for separable kernels in N dimensions.

However, for @ChrisRackauckas 's purposes, Laplacian is treated as a special case, and I don't think I've implemented parallel computation for it yet (would not be hard, though).