jacobhinkle commented 5 months ago

I'd like to facilitate a discussion on how to represent hierarchical programs in our IR. This topic has come up repeatedly in various contexts so there might be an opportunity to introduce a nice abstraction for it.

Current status

Our IR currently has IrContainer which is the base class for Fusion. IrContainer owns all the Statements in the "fusion", and the Fusion class mostly manages inputs and outputs; this includes updating val->uses() when expressions are registered, since this can change the reachability of each statement. Any other child classes of IrContainer are themselves children of Fusion (namely, kir::Kernel).

`Expr`s are the simplest subfusions

In some sense, the core function of a Fusion (managing inputs/outputs) overlaps that of Expr. Namely, both represent some computation that takes in a collection of Vals and outputs some other Vals. This is the only sense in which we currently support "sub-fusions"; i.e. since a Fusion contains multiple Exprs, we can think of it as a hierarchy of subfusions.

The purpose of this proposal is to represent subfusions at intermediate granularities between "entire fusion" and "single expression".

Use cases

Partitioning `Fusion`s into tasks

Partitioning the graph and executing segments separately from one another. This is important for

Segmentation. Currently segmentation requires a specialized container called SegmentedFusion which tracks collections of edges and groups of expressions representing segments.
Tasks for multidevice programs
Task groups for warp specialization #210

During segmentation, to analyze each segment we have a guard class that swaps the fusion inputs and outputs so that analysis can use our standard traversal utilities. Instead, it might be useful to represent segments during segmentation as subfusions that overlay the base fusion. This idea could be generalized to handle task parallelism

`ExpressionEvaluator` patterns (to replace composite IR nodes)

Recently, we introduced IR nodes for matmul patterns (#2175, #2240, #2294). These have greatly simplified our ATen fallback evaluation mode, replacing complex pattern matching code. However, fusion of these patterns still requires decomposition (#2236) into smaller primitives. If we had a working subfusion system, we might do both at once at program definition. For example, our linear function might create BroadcastOp nodes as well as a MmaOp node, add bias with BinaryOp then cast to reduced precision using UnaryOp. All these ops would then be grouped into a subfusion that is identified as a "Linear" pattern so that evaluation of its output could easily be identified and computed using ATen.

Lambdas for generalizing reduction/scan

When discussing #622 recently, @naoyam pointed out that representing lambdas directly in the tensor IR tends to clutter and complicate the fusion (see #2307). With the right abstraction, we could potentially maintain a lambda function as a subfusion that is disconnected from the main Fusion inputs and outputs, so that we would no longer need special IterTypes and we could use a single op like FoldOp to represent a reduction or scan. Note that we could still schedule the lambda as desired but we would not need to take as much care with inlining and allocation concerns if it is disconnected and represented in this way.

Possible approach

One approach would be to keep the current system intact, but add a new Task type like this

class Task : PolymorphicBase {
 public:
  const std::vector<Val*>& inputs() const;
  const std::vector<Val*>& outputs() const;
  // compute uses of Val* within this Task
  const std::vector<Expr*>& uses(Val*);
  // ...
};

Fusion could own these Tasks just like it currently owns Statements. Task could potentially be made a subclass of Statement to facilitate cloning.

We could generalize Val::isFusionInput() and Val::isFusionOutput() with the following:

class Val : public Statement {
 public:
  // ...

  // compute tasks having val as an input
  std::vector<Task*> taskUses() const;
  // compute tasks having val as an output. Note that multiple Tasks might output the same
  // Val as they may be nested.
  std::vector<Task*> taskDefinitions() const;
};

Does this address the use cases above?

No translation would be required for the matmul use case since we would create the full decomposed fusion at definition, and also create a matmul Task at that time. The ExpressionEvaluator could use this to check for supported fallback patterns by first checking for supported val->taskDefinitions() then checking val->definition().
Likewise, we could have a class Segment : public Task {}; so that creating and modifying segments doesn't alter the fusion but just modifies Tasks. Then checking whether a Val is a segmentation edge just means finding a task in val->taskDefinitions() or val->taskUses() that isA<Segment>().
For fold/scan, e could create a Task to use as a lambda, then just attach it to our FoldOp or ScanOp as an attribute.
I am not sure yet how this would impact multidevice scheduling or how it might help us implement warp specialization.

### Tasks
- [ ] Come up with a better name than "Task"
- [ ] Remove inheritance of `IrContainer` from `Fusion`. Instead, `Fusion` will hold a `shared_ptr<IrContainer>`, accessible via `IrContainer* container() const`.
- [ ] Create "`Task`" with example of a task-specific traversal
- [ ] Identify issues and address...
- [ ] Adapt segmentation to use "tasks"
- [ ] Adapt MatmulOp etc. to use tasks

jacobhinkle commented 5 months ago

cc @naoyam @zasdfgbnm @kevinstephano

zasdfgbnm commented 5 months ago

Is Task a subclass of PolymorphicValue, or PolymorphicBase? Can it be a subclass of Expr, or the other way (Expr is a subclass of Task?) Or, can Fusion be a subclass of Expr or Task?

jacobhinkle commented 5 months ago

Is Task a subclass of PolymorphicValue, or PolymorphicBase?

Thanks for noticing the typo! PolymorphicBase so that we can easily check what type of Task we're dealing with, as we do with Expr.

Can it be a subclass of Expr, or the other way (Expr is a subclass of Task?)

I think we could potentially make Expr a specific kind of Task. That is a big change though, as much depends on val->definition().

Or, can Fusion be a subclass of Expr or Task?

Yeah I would think our current notion of "Fusion" would be a type of task, maybe renamed as "Program" since fusion of CUDA kernels only happens within Segments. Anyway, aside from naming, I think it would be clean to unify the concepts. The only reason I didn't propose that straight away is that it's such a big change. Same for Expr, which resides at the other end of the spectrum.

csarofeen commented 5 months ago

Having nested containers of IR seems critical for so many applications! I used to think we should:

Disconnect IRContainer from Fusion. Fusion should have an IR Container; it shouldn't be one.
Fusions should be able to share the same IRContainers, so they can refer to eachothers nodes.
Fusions should be able to contain other Fusions as a node.

What this could achieve: I thought with this set of behavior then you could use all the tools as they are, but you could arbitrarily view the entire program as multiple hierarchical views that could be linked because they could share the same nodes for inputs/outputs of expressions.

I figured this way we could view segmented Fusions as a view of the original fusion you're segmenting. That way you could more easily traverse from one view to another view.

I also thought this could be useful to have "reversible" transformations, as the new transformation could avoid destroying the original.

Challenges: A few interesting challenges would be to be able to garbage collect (we could start generating a stupid number of nodes and may want to cleanup based on a set of fusions). Or at least cleanup nodes that aren't associated with any fusion that's still alive.

Consistently modifying the IR, if one fusion is changed but has references to another, are those connections updated automatically somehow? Is there any consistency enforced in the hierarchical scheme?

NVIDIA / Fuser

Task formalism in our IR [Inspired by Online Softmax] #2329

Current status

`Expr`s are the simplest subfusions

Use cases

Partitioning `Fusion`s into tasks

`ExpressionEvaluator` patterns (to replace composite IR nodes)

Lambdas for generalizing reduction/scan

Possible approach

NVIDIA / Fuser

Task formalism in our IR [Inspired by Online Softmax] #2329

Current status

Exprs are the simplest subfusions

Use cases

Partitioning Fusions into tasks

ExpressionEvaluator patterns (to replace composite IR nodes)

Lambdas for generalizing reduction/scan

Possible approach

`Expr`s are the simplest subfusions

Partitioning `Fusion`s into tasks

`ExpressionEvaluator` patterns (to replace composite IR nodes)