How much auto-inlining should the compiler do, if any

jabraham17 commented 1 year ago

Currently the only time the Chapel compiler will inline code is if the function is prefixed with inline. Can/should we support automatic inlining on top of whatever inlining the backend does?

Pros

Better optimization opportunities at call site
Potential better optimization for GPU code
- Offline discussion on using inlining to facilitate better LICM (cray/chapel-private#4737)

Cons

decreased compilation time
bigger generated executables
- I am not sure we will see significant issues here given the backends already do some inlining, but it is possible

If we decide we want to support this in the compiler, what should our criteria be for it? As a start, we could specify an inline threshold. Any function smaller than that threshold would be automatically inlined. This could be different values for CPU vs GPU. Perhaps also using the number of calls to influence it as well.

bradcray commented 1 year ago

I don't think we're at all philosophically opposed to having our compiler do more automated inlining and that it's the "what would our criteria be?" question that has prevented us from doing it before now (combined with a sense that if it's sufficiently obvious that something should be inlined, the back-end compiler would probably do it).

Any function smaller than that threshold

Smaller by what metric? (e.g., if "Chapel source lines", one Chapel source line can expand to a ton of code—e.g., a scalar routine promoted by a distributed array).

This also seems like one of those optimizations that might lead to wanting "optimize for execution time" / "optimize for compilation time / binary size"-style compiler flags.

jabraham17 commented 1 year ago

Smaller by what metric?

Maybe in terms of AST nodes, although I am not sure that is a good measure of the size of the code generation. Different nodes could have different weights to them to help mitigate this, but I am not sure that is a perfect solution either.

This also seems like one of those optimizations that might lead to wanting "optimize for execution time" / "optimize for compilation time / binary size"-style compiler flags.

I thought I put this in the OP, but I guess it slipped my mind. I would envision supporting something akin to the LLVM flag -inline-threshold instead. This can be set lower by default and increased for --fast. This takes care of "optimize for execution time" / "optimize for compilation time", but not "optimize for binary size".

e-kayrakli commented 1 year ago

I support exploring this direction in general. We have seen this in different contexts from GPU side of things:

Clang does more aggressive inlining for GPU code paths. Function calls and in general control-flow changes in GPU are more costly. (Mentioned in https://github.com/Cray/chapel-private/issues/4743)
We've observed that our LICM improves GPU execution noticeably when it fires for array metadata. However, it is not very easy to make it work for (Mentioned in https://github.com/Cray/chapel-private/issues/4737#issuecomment-1548693073):

on here.gpus[0] {
  var A: [1..n] int;
  foreach i in 1..10 {
    foo(A,i);
  }
}

proc foo(x,i) { 
  // do stuff with x[i]
}

It is because the function strictly needs an array to be passed, which implies that A still has to be passed to the kernel, and the metadata needs to be dug through in foo. If we could inline that function before LICM fires (inlining happens before LICM today), it could deliver better performance both because of inlining and better LICM prospects. From that perspective a starting heuristic would be to inline functions that take array arguments.

If we decide we want to support this in the compiler, what should our criteria be for it? As a start, we could specify an inline threshold. Any function smaller than that threshold would be automatically inlined. This could be different values for CPU vs GPU. Perhaps also using the number of calls to influence it as well.

Related to the GPU case, but also this: an auto-inlining strategy could very well inline functions into some call sites and leave them un-inlined for others. If you see a call in a tight loop, probably you can be more aggressive while inlining the function in there. It shouldn't mean you should inline the function at every call-site because of that, though.

Relation to what I say above is that function calls can be inlined more aggressively into GPU kernels, similar to the tight loop example above.

mppf commented 1 year ago

(I haven't read all of the above, but here are my 2¢). I think that we should leave the inlining to LLVM optimizations. If we want to do LICM of array metadata after inlining, I think we should create an LLVM optimization that LICMs our array metadata and plug that into the LLVM optimization pipeline. This will necessarily involve finding a way to propagate certain information through the LLVM IR to the optimization (which could be as simple as relying on type names, but might be custom attributes/metadata, or our own intrinsics). The short story here is that I think the LLVM IR is a more reasonable representation for optimization and analysis (including things like alias analysis) than the Chapel AST.

chapel-lang / chapel

How much auto-inlining should the compiler do, if any #22346