Static Groups and Control

rachitnigam commented 1 year ago

This proposal follows up on https://github.com/cucapra/calyx/discussions/1334 to add static groups to the language to enable reasoning about timing. The core abstraction is using a "relative clock" to trigger actions. It additionally deprecates the @static syntax in favor of a static keyword.

Static groups

A static group is declared using the new static keyword:

static group foo {    // Starts at time `t`
    r.in = 10;
    r.write_en = %0;  // Asserted at `t+0`
    foo[done] = %1;   // Asserted at `t+1`
}

The %n syntax is a relative clock. The %0 is the current cycle, %1 is the next cycle, and so on. The %n syntax is only allowed in the static group. The well-formedness constraint for a static group is that its done condition must be a relative clock and all actions within are assumed to complete before the done condition is asserted.

Chaining operations is straightforward:

static group two {
    r.in = 20;
    r.write_en = %0;
    d.in = 10;
    d.write_en = %1;
    two[done] = %2;     // Must be %n
}

This group looks suspiciously similar to a seq of two groups that write values into the registers r and d. This is intentional.

Static Control

Static control operators simply use the static keyword in front of the standard control operator. The well-formedness condition requires that they only contain other static control programs within them. Group enable of a static group is the smallest static control program. Furthermore, they obey all the constraints and guarantees from https://github.com/cucapra/calyx/discussions/1331. The composition principle for static control programs is that they can be compiled into static groups.

Sequences

static group one { r.in = 10; r.write_en = %0; one[done] = %1; }
static group two { a.in = 20; a.write_en = %0; two[done] = %1; }
control { static seq { one; two } /* Both groups must be `static` */ }

The semantics of static control operators are given in terms of a relative clock, i.e., one starts at %0 and two starts at %0 + %(one's delay). This means we can compile the static control program into a static group:

static group compiled {
    r.in = 10;
    r.write_en = %0;
    a.in = 20;
    a.write_en = %1;
    compiled[done] = %2;
}
control { compiled }

In general, the compiled group looks like:

static group compiled {
    one[go] = %[0:one's delay]; // This has concrete numbers in it
    two[go] = %[(one's delay):(one's delay + two's delay)];
    compiled[done] = %(one's delay + two's delay);
}

The special syntax %[n:m] is a signal that is asserted between n and m cycles. During compilation, one's delay etc. are going to be concrete numbers.

Parallel

static group one { r.in = 10; r.write_en = %0; one[done] = %1; }
static group two { a.in = 20; a.write_en = %0; two[done] = %2; }
control { static par { one; two } // Both groups must be `static` }

Semantics is provided the following group:

static group compiled {
    r.in = 10;
    r.write_en = %0;
    a.in = 20;
    a.write_en = %0;
    compiled[done] = %1;
}
control { compiled }

In general the compiled group looks like:

static group compiled {
    one[go] = %[0:one's delay];
    two[go] = %[0:two's delay];
    ...
    compiled[done] = %(max(one's delay, two's delay));
}

Conditional

static if port { tru } else { fal }

Conditionals execute their branches in the same cycle when they begin:

static group compiled {
    // Save the value from the port
    cond.in = port;
    cond.write_en = %0;
    // Use a wire to forward the right value
    w.in = %0 ? port : cond.out;
    // Execute the branches using the wire value
    tru[go] = w.out ? %0;
    fal[go] = !w.out ? %0;
    compiled[done] = %(max(tru's delay, fal's delay));
}

A couple of things are interesting about this encoding:

We need to save the value from the port in the first cycle because we don't if the branches will modify its value.
- This can probably be relaxed when we know that neither branch modifies the port
The end time for the group is the max of the two branches which implicitly encodes the fact that we balance if branches before compiling them.
This encoding does not match how we lower @static if in https://github.com/cucapra/calyx/pull/1338 which allocates different states for each branch and transitions to one or the other path based on the condition. Perhaps this distinction doesn't matter?

Loops

Static loops need to necessaily be bounded. I'm going to define a new control operator repeat n to reflect the fact that static loop without a @bound is meaningless. Maybe we want to eventually actually define this as a control operator (in which case we'd need to handle the case where the body is not static):

@bound(n) static while port { body }

same as:

static repeat n { body }

This probably gets compiled to:

static group compiled {
    body[go] = %[0:n*body's delay];
    compiled[done] = %(n*body's delay);
}

I'm not quite sure if this is correct so would have to think harder.

Lowering to Dynamic Groups

In order to compile a Calyx program, we first lower all static groups to normal groups by reifing the relative clock. For example, the following static group:

static group foo {
    r.in = 10;
    r.write_en = %0;
    foo[done] = %1;
}

Compiles to:

cells {
    fsm = std_reg(4);
}
group foo {
    r.in = 10;
    r.write_en = fsm.out == 0;
    foo[done] = fsm.out == 1;
}
fsm.in = fsm.out == 1 ? 0;
fsm.out = fsm.out != 1 : fsm.out + 1;

The last two assignments are continuous since we need to be able to reset the FSM that provides the relative clock. In a fully static program, ideally we'd generate one static FSM per par thread.

sampsyo commented 1 year ago

Hell yes; this seems great. I really like the introduction of the explicit "relative time" bubble within each group; it is a great way to reflect the philosophy of group isolation while transposing it into a statically-timed domain.

To state something explicitly that is implicit in the above… since every static group has a line like this:

group[done] = %N;

…you can easily get the group's latency (what we currently call @static(N)): it's N, of course. :smiley:

One other obvious thing to make explicit: the time instant %4 is sugar for the time range %[4:4].

On static repeat n loops: maybe this was what you were getting at already, but it seems like we'd want to copy-n-paste the body's assignment n times? That is, keeping it active for longer doesn't seem the same as repeating its effects multiple times.

mikeurbach commented 1 year ago

This is really interesting. I think it will certainly help accurately and precisely capture the kinds of pipelines we've Calyx for in CIRCT!

One question that wasn't immediately obvious:

The proposed <lhs> = %[n:m] construct says to assert <lhs> from relative cycle n through m, correct?

I was initially confused, because normally the right hand side of an assign can be constants, ports, etc. (I forget the exact rules). But this is a pretty specific case (a constant 1 at certain time steps).

Is the idea this special case is the only limited way this can be used? If so I guess there is a rule that the <lhs> must be a single bit? Are there other restrictions?

Also, what is the value of <lhs> outside the timesteps n:m? Is it 0/X? I guess if it is just this specific case, the compiler knows it is 0 outside the time steps where it is 1.

sampsyo commented 1 year ago

Good questions, @mikeurbach. @rachitnigam is of course the authority here, but I can give you my gut reactions and get corrected down the line:

Yep, the new %[n:m] construct would be a new kind of thing that can go on the RHS of assignments, adding to constants and ports. It is the new invention here to cope with static timing.
Yeah, I suppose the LHS would have to be a single bit. (In Calyx terms, %[n:m] expressions would have the type 1, which means they can only be assigned into 1-bit ports.)
Super good question about undefinedness. I would argue that we need it to be 1 for the range n–m and 0 otherwise to make most things make sense. (It would only be undefined when the entire group is inactive.)

mikeurbach commented 1 year ago

Thanks for the explanation @sampsyo, that makes sense to me. The part about the type of %[n:m] expressions really clears it up. Looking forward to seeing where this goes, and getting CIRCT to support the new representation so we can target it from our all-static flow.

EclecticGriffin commented 1 year ago

Misc syntax note from the meeting. The done assignment could possibly be folded into the declaration rather than the assignment syntax

static group foo {    // Starts at time `t`
    ...
    foo[done] = %1;   // Asserted at `t+1`
}

becomes something like (syntax entirely made up)

static group foo<1> {    // Starts at time `t`
    ...
}

sampsyo commented 1 year ago

Indeed; there is something nice about @EclecticGriffin's proposed syntax (or, like, static(1) group foo, which is very similar to our current annotation system) because it doesn't imply that there is an actual signal required to tell the group to stop executing (as there is in a "normal" group).

Two minor things struck me after our synchronous conversation yesterday:

Dynamic Within Static

v1.0 of this system will allow nesting static "islands" within dynamic control programs, which is definitely the most useful thing (it's an enactment of the "globally asynchronous, locally synchronous" (GALS) principle). We don't allow the other way around, for obvious reasons. But @andrewb1999's questions made me think about whether we could, in an extension, do something to allow the "backward" nesting.

Here is a very silly strawperson construct to achieve this. Introduce a limit control construct:

static seq {
  my_static_group;
  static limit(25) {
    my_dynamic_group;
  }
  my_other_static_group;
}

Here, my_dynamic_group can take any amount of time by itself. Wrapping it in limit(25) makes it take a deterministic 25 cycles. It does this by just aborting the execution altogether if it takes any longer than 25 cycles, and idling if it takes less than 25 cycles. Sort of a brute-force thing, but you can kinda imagine semi-contrived situations where you would want this: for example, when you know (but can't prove to the compiler) that accessing a memory takes either 3 or 25 cycles (on a cache hit or a cache miss).

Dynamic `while` Around Static Body

A thread that comes up often, again raised by @andrewb1999, is how we waste a cycle in traditional Calyx while loops:

while cond {
  body;
}

If body is a dynamic group that takes 5 cycles, then the overall loop takes 6 cycles per iteration. We waste one cycle to "realize" that the body has finished executing. This seems sad.

It should be a design goal for this proposal that we don't waste a cycle, i.e., the same loop takes only 5 cycles per iteration, when body is a static group. (Of course, while is always dynamic—the only kind of static loop that makes sense is a repeat loop.) This goal implies that we cannot compile the loop by first "wrapping" body in a dynamic interface and then compiling as normal (which is the normal/universal approach to compiling dynamic islands nested within dynamic control). Instead, we would want our dynamic control compilation pass to have a special case for while loops that wrap static bodies.

rachitnigam commented 1 year ago

@sampsyo one note on a previous comment:

On static repeat n loops: maybe this was what you were getting at already, but it seems like we'd want to copy-n-paste the body's assignment n times? That is, keeping it active for longer doesn't seem the same as repeating its effects multiple times.

Not sure if we want to duplicate the body because that tends to blow up the FSM by a lot. The TDST implement (#1338) is explicitly designed to not duplicate the body for this reason

sampsyo commented 1 year ago

Ah, yes, good point. Just copy-and-paste also seems wrong for that reason! At least for how to actually compile stuff; I hope it works OK as a trivial/inefficient way to compile it (i.e., a "reference semantics").

Stepping back a little to the thing I was reacting to originally: this proposed compilation for repeat n { body } still looks a little weird to me:

static group compiled {
    body[go] = %[0:n*body's delay];
    compiled[done] = %(n*body's delay);
}

Partly because I'm not sure exactly what assigning to body[go] really does when body is a static group, and partly because it doesn't say much about how to repeatedly re-start the relative clock for body.

To state the perhaps-obvious, the way we'd actually want to compile this is to create a counter register. Call it ctr, and it has ceil(log(n)) bits. This is separate from the "relative clock" FSM register for body—call that fsm. Whenever fsm hits body's delay, we increment ctr and reset fsm back to 0. When ctr hits n, the loop is done.

The trouble is that this compilation result seems hard to express as Calyx code. (Whereas the trivial copy-n-paste approach is of course very easy to express.)

mikeurbach commented 1 year ago

This occurred to me today during the CIRCT ODM, but have you looked at TL-X? CIRCT hosted a talk early last year. One thing that stuck out then is the "alignment" operator, which seems spiritually similar to the %[n:m] operator here. The syntax in TL-X has changed through revisions, but it's the >> and << in Section 16.4 here. I think the construct in TL-X is a more general way to access any value across pipeline stages, but seems spiritually similar in the notion of relative time offset. I guess I'm curious if you can draw any inspiration from TL-X as you are thinking about IR and languages for pipelining.

sampsyo commented 1 year ago

Thanks for the pointer; I had not seen this particular feature in TL-*!

rachitnigam commented 1 year ago

This has been implemented in spirit.

calyxir / calyx