hvdieren commented 8 years ago

Hi,

We have been seeing slow performance with Cilkplus and clang for tight loops using hyperobjects. Apparently, the Intel compiler manages to hoist __cilkrts_hyper_lookup() out of the critical loops while clang leaves the call in the inner loop, causing significant performance degradation (we have observed up to 2x).

Will you be looking into this?

Kind regards, Hans Vandierendonck

andreybokhanko commented 8 years ago

Hi Hans,

I don't think this particular optimization will be on our radars in the near future.

But patches are welcome (hint, hint)!

Yours,

Andrey

Software Engineer Intel Compiler Team

On Wed, Jul 13, 2016 at 2:11 AM, Hans Vandierendonck < notifications@github.com> wrote:

Hi,

We have been seeing slow performance with Cilkplus and clang for tight loops using hyperobjects. Apparently, the Intel compiler manages to hoist __cilkrts_hyper_lookup() out of the critical loops while clang leaves the call in the inner loop, causing significant performance degradation (we have observed up to 2x).

Will you be looking into this?

Kind regards, Hans Vandierendonck

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cilkplus/clang/issues/22, or mute the thread https://github.com/notifications/unsubscribe/AFBnOp_7uNbyAQqrUbF6EWqtGH1eIw6oks5qVB84gaJpZM4JK6r9 .

hvdieren commented 8 years ago

Andrey,

Thanks for the quick response. Hint taken. Is it possible to discuss, perhaps off-line, on what basis the Intel compiler decides to hoist the call out of loops?

Kind regards, Hans.

mikerice1969 commented 8 years ago

Hi Hans,

Please take a look at section 9.10 on page 11 of https://software.intel.com/sites/products/cilk-plus/cilk_plus_abi.pdf.

Does that answer your question?

Mike

phalpern commented 8 years ago

The key section of the ABI document that Mike referred to is this:

9.10 void __cilkrts_hyper_create(__cilkrts_hyperobject_base *key);
void __cilkrts_hyper_destroy(__cilkrts_hyperobject_base *key));
void* __cilkrts_hyper_lookup(__cilkrts_hyperobject_base *key);
These functions are called by the reducer library to implement reducers. These are normal function calls, from the standpoint of calling conventions. However, the compiler writer should be aware that cilkrts_hyper_lookup()will return the same value each time it is called with the same key until the next spawn, sync, or call to cilkrts_hyper_destroy() for that key. This fact allows the compiler to lift the lookup call out of serial loops, etc., in order to avoid excessive lookup overhead. > Also, it is not possible for two different keys to return the same value from lookup. Thus, if the compiler > can determine that two key pointers are distinct, then it can also assume that the results of calling lookup on the key pointers are also distinct.

In other words, a hyper-object lookup can be hoisted out of any loop that does not contain a spawn or sync. I'd be happy to discuss further over email, if you want. It'd be great if this optimization got implemented in clang/LLVM.

cilkplus / clang

Slow execution on tight loops with hyperobjects #22

Andrey