calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
495 stars 50 forks source link

Zero-cycle transitions from dynamic to static control #1828

Closed rachitnigam closed 3 months ago

rachitnigam commented 10 months ago

To slightly expand on this:

Some of the grouping we were initially trying to have, a blocking group, a servicing group, an assert and deassert group, had to be combined together due to the time it takes to transfer between groups.

I actually think this is a super interesting problem that has implications for static/dynamic Calyx assignments. Artistically, we would like to be able to manage an AXI transaction with a control program like this:

while <more transactions to do> {
  seq {
    assert_ready;  // tell the remote side we are ready for their data
    wait_for_valid;  // wait for them to send us stuff
    par {
      do_transfer;  // actually do things the data on the bus
      deassert_ready;  // tell them to wait while we process the data
    }
  }
}

The problem is that AXI, of course, is super cycle-timing-sensitive. Remember that one of the core invariants is:

On any cycle when both ready=1 and valid=1, a transaction occurs.

So the remote side can schedule these transaction cycles back-to-back, i.e., 3 transactions can happen in exactly 3 cycles. Most saliently for us, say the remote side just keeps its valid signal high all the time and relies on us to assert/deassert ready to do transfers. Then, if we need to take some time to process a transfer, we have to be absolutely sure we only assert ready for one cycle at a time.

The problem with dynamic Calyx is that it offers no guarantees about how long it takes to go from wait_for_valid above to deassert_ready. And in practice, it actually does waste at least one cycle, meaning that the remote side sees us do two transfers when we only wanted one.

The big upshot of static Calyx is that it does offer these guarantees, so we would like to be able to just sprinkle the static everywhere get the timing we want. There is a critical problem, however: the wait_for_valid group is fundamentally dynamic. Our "dynamic islands" approach does not support sequencing a dynamic group with a static one with a zero-cycle transition latency.

So the big design question that I think this opens up is: is there some rule we can make up that would allow this kind of sequencing? That is, to allow some version of seq { a ; b } where a is dynamic and b is static where we can guarantee that b runs in the same cycle where a says it is done. It seems hard to do in general but maybe there is something we can do for restricted cases?

Originally posted by @sampsyo in https://github.com/calyxir/calyx/issues/1820#issuecomment-1866308023

rachitnigam commented 10 months ago

That is, to allow some version of seq { a ; b } where a is dynamic and b is static where we can guarantee that b runs in the same cycle where a says it is done.

The two relevant issues here both have to do with the early-transitions flag for the dynamic compilation pass (tdcc): #662 and #621.

The high-level problem is that, in general, we cannot transition from a to b in the same cycle when generating dynamic FSMs because the done condition of b could be the same as a. However, note that this problem cannot occur if b is static because it does not have a done condition. This means that it might be possible to always guarantee that transitions from a dynamic group into a static island takes zero cycles. @calebmkim & @paili0628 anything I'm missing here?

Another question worth asking is that can transitions out of static islands be zero cycle as well? My guess is that the current answer is no because we generate a dynamic wrapper for static islands which means we adopt the 1-cycle delay in dynamic-to-dynamic transitions.

calebmkim commented 10 months ago

The high-level problem is that, in general, we cannot transition from a to b in the same cycle when generating dynamic FSMs because the done condition of b could be the same as a. However, note that this problem cannot occur if b is static because it does not have a done condition. This means that it might be possible to always guarantee that transitions from a dynamic group into a static island takes zero cycles. @calebmkim & @paili0628 anything I'm missing here?

Yeah, I think this is right. Although to rephrase slightly, we are technically giving b a done condition "under the hood" by adding a wrapper. We just know b's done signal won't be the same as the done signal of a, since we are the ones creating the register for the done signal.

Another question worth asking is that can transitions out of static islands be zero cycle as well? My guess is that the current answer is no because we generate a dynamic wrapper for static islands which means we adopt the 1-cycle delay in dynamic-to-dynamic transitions.

I think this goes back to the previous comment. Suppose we have seq {a; b;} but this time a is static and b is dynamic. We already know b's done condition, and we know what a's done condition will be when we compile it: so if we can guarantee they'll be different, then is it possible to do an early transition?

One thing I'm realizing is that, if we do this, maybe we should move the wrapping logic from compile-static to tdcc... or somehow have some way for the tdcc pass to know which groups were static islands and which were plain dynamic groups.

rachitnigam commented 10 months ago

so if we can guarantee they'll be different, then is it possible to do an early transition?

This was thinking too but this is a fact about dynamic -> dynamic transitions too. It seems that maybe we can unconditionally guarantee that transitions from dynamic to static will be zero cycle which might be enough for what the YXI team needs.

One thing I'm realizing is that, if we do this, maybe we should move the wrapping logic

Yeah, this makes sense. There is a lot of optimizations here with optimizing wrapper generation and transitions that we haven't explored. I think these could be high-impact because FSMs are likely the worst offenders in terms of resource blow up.

sampsyo commented 10 months ago

Really good points here. To summarize, we believe it is possible for the compiler to provide zero-cycle transitions in seq { a ; b } when it can be certain that a[done] & b[done] is UNSAT. There are two things we can do with this information:

  1. Write a freaky pass that tries to prove this theorem and, when it can, somehow gets TDCC to provide such zero-cycle transitions in the FSM logic. This is purely opportunistic and provides no language-level guarantees, so it is not helpful for #1820, but it is still a cool optimization.
  2. Exploit the observation that a[done] & b[done] is guaranteed to be UNSAT when b is a static group to provide some kind of language-level guarantee.

I think one question we'd have to answer for option 2 would be: what does the surface-level, syntactic rule look like that implies this guarantee? It could just be this:

In any dynamic seq, consider every adjacent pair of child statements. (They need not be group activations.) If the first statement in the pair is dynamic and the second is static, it is guaranteed to start executing during the cycle that the first statement becomes done.

I don't like this very much because it means that even a dynamic seq can tie the compiler's hands w/r/t scheduling. We can't do schedule compaction on any seq that contains this adjacent pairing, for example.

An alternative (which would be pretty heavyweight) would be to introduce a new, special control operator just for this purpose.

rachitnigam commented 10 months ago

The opportunistic stuff is an optimization so I won't say much about it. You're right about the trade-offs: no semantic guarantees and therefore not particularly useful timing-precise reasoning.

I was thinking of a rule of the form that "entering a static control program is guaranteed to be zero-cycle". However, I agree with you that this is putting constraints on the way dynamic scheduling works. Unfortunately, I don't see a way out of it: if we want to provide guarantees between the timing interactions between dynamic and static control, we'll have to make statements about the scheduling policies for dynamic operators. We can try adding attributes but that doesn't really address the problem.

sampsyo commented 10 months ago

I suppose my high-level point about the dilemma here is that we probably want to support both modes. That is, seq { a ; b } should keep its current guarantees and remain schedule-able, for the general case where the code is not timing-sensitive and you just want the computation done as quickly as possible. But perhaps we want to invent something else that is more in the vein of static seq, in the sense that it is useful for carefully orchestrating time, but that does not require all its children to be static.

rachitnigam commented 10 months ago

How do you feel about making the "slow transition" behavior opt-in in the same we've been discussing adding a new attribute to make static control programs compactable? That is, in the normal case, we will generate fast transitions from dynamic to static sub-programs but if you give us an attribute to say "I don't care", then we are allowed to generate slow transitions?

I'm hesitant of introducing a new control operator with these semantics because then we have three different kinds of sequencing and I don't know what the guideline for which is a "good default" will be. If we already know that the good default is "fast transition", then let's make seq default to that?

sampsyo commented 10 months ago

It could work! Yeah, I think what's confusing here is that I also don't know what the right default is: schedulable vs. save-a-cycle. There is a reasonable argument to be made that "schedulable" is the right default because it reflects "normal" computations that don't need to interface with the outside world. Especially if paired with a best-effort optimization that saves a cycle in most cases anyway?

rachitnigam commented 10 months ago

I think defining which computations potentially interact with the outside world will become increasingly difficult, especially in presence of ref cells. If that is the case, then we should make the precise behavior ("save a cycle") the default and allow programs to opt out of it.

I would also reframe "save a cycle" to something like "dynamic to static transition guarantee" to imply that this isn't just an optimization; it's a guarantee of the compiler that user-level programs can rely on.

calebmkim commented 10 months ago

During a synchronous discussion, we were trying to generalize the problem expressed in the original post: in other words, come up with other situations in which we need cycle-precise behavior for control between the static/dynamic boundary.

For example, one scenario we were thinking of was the following If we had something like:

while port { 
  seq { 
    // static child 
    // dynamic child
  }
} 

What if we need the backwards edge from dynamic child -> static child to be a 0 cycle transition?

@nathanielnrn can you think of any other scenarios in the yxi project where you would need 0 cycle transition behavior, besides the scenario we have already identified?

nathanielnrn commented 10 months ago

It seems to me like the dynamic -> static description covers the cases we are encountering in AXI.

rachitnigam commented 10 months ago

@nathanielnrn I think Caleb is asking if there are specific, concrete examples you can think of. For example, one possible example is a burst transfer where, once the valid signal is asserted, the component needs to start reading values starting the next cycle for n cycles where n is the burst size:

seq {
  wait_for_valid;
  repeat N { 
    par { incr_idx; write_inp }
  }
}
nathanielnrn commented 10 months ago

@calebmkim and I spoke synchronously about this. The scenarios we came up with are dynamic -> static transitions like those discussed above as well as a likely-useful optimization for transitions between the end of while loops and the start of them.

A reduced example would be something like

while perform_writes.out {
  block_transfer;
  perform_write_transfer; //having a group like this would require 0 transition dynamic -> static 
}

If we had some guarantee on the transition between the end of perform_write_transfer to the start of block_transfer in between while cycles, we could likely save on some boiler plate assignment/groups that exist in the current implementation to guarantee write_transfer handshakes happen for a single cycle. This would almost definitely increase the throughput of our interface down the road.

sampsyo commented 10 months ago

This is great; thanks for the synopsis. And interesting that this example contains both static->dynamic and dynamic->static transitions!

rachitnigam commented 9 months ago

Another bump on this issue because I'd like to make progress on this. @calebmkim thoughts on taking the next open slot in the Calyx meeting to discuss this issue and maybe propose concrete implementation strategy?

calebmkim commented 9 months ago

Just took the next open slot

calebmkim commented 9 months ago

Based on the synchronous discussion today, it seems like we probably do not want to introduce a zero-cycle transition as a guarantee of the language, as it would infect the scheduling of seqs and would prevent things like compaction.

The option that we had in mind was to add a minimally expressive extension as an attribute: we can call it @qt for quick-transition. It can be attached to a dynamic seq and it will guarantee zero-cycle transitions between each child of the seq. To support this guarantee, we require that any @qt seqs alternate between static and dynamic children. In other words, if we have

@qt seq {
  A; 
  B; 
  C; 
  D; 
} 

A->B->C->D must either be Static->Dynamic->Static->Dynamic or Dynamic->Static->Dynamic->Static.

From the synchronous discussion, it seems like the specificity is the point. We normally don't want to use this, but in case you have some specific timing constraints that you need to be met, then we will offer this special attribute to you.

@rachitnigam do you have thoughts?

anshumanmohan commented 9 months ago

We are limiting this to the alternating style because any non-alternating case can just be crafted using the alternating style (above) plus calls to appropriate control operators.

Say you need:

@qt seq {
  s1;
  s2;
  d1;
  d2;
  s3
}

where s*s are static and d*s are dynamic.

Too bad, you must write:

@qt seq {
  static seq { s1; s2; };
  seq { d1; d2; };
  s3
}

This is neat because it crisps up what the promise of @qt is: my immediate children will enjoy quick transitions, I don't know what will happen below them.

anshumanmohan commented 9 months ago

Perhaps silly, but do we need to do anything to prevent:

@qt seq {
  d1;
  empty_static_group_that_takes_time_0;
  d2
}

?

Are static groups with latency 0 even allowed?

rachitnigam commented 9 months ago

Zero-cycle groups or static groups are not allowed. If a program provides a zero-cycle group, it leads to undefined behavior (miscompilation, specifically)

rachitnigam commented 9 months ago

I remain iffy about attributes as a mechanism to add correctness guarantees to the program. Specifically, if the compiler does not guarantee that it'll preserve or even respect attributes, I feel that the default behavior should be the one that preserves correctness.

it would infect the scheduling of seqs and would prevent things like compaction

I'm not convinced about this yet. Compaction only works for @promote blocks (using the new parlance from #1896). Can we write down a concrete program that exemplifies this? If we can't I will continue advocating for language-level semantics for early transitions.

rachitnigam commented 9 months ago

CC @sampsyo on this last comment because I think we disagree on this point and maybe I'm missing something obvious

calebmkim commented 9 months ago

@sampsyo can also give his own answer, but here is a minimum example of where I think the language-level guarantee would affect our compaction abilities.

Suppose we have the following seq:

@promotable(n + m) seq {
  @promotable(n) A; 
  static<m> par {.. } 
} 

In this case, compaction would not be allowed to parallelize the execution of A and the static<m> par.

If the seq had more children, then we would be able to compact other parts of it, but we'd have to think carefully about what the language-level guarantees are and what types of compaction they allow.

rachitnigam commented 9 months ago

In this case, compaction would not be allowed to parallelize the execution of A and the static par.

So this is what I don't understand. If A does not have any dependency on par { ... }, the zero-cycle transition semantics should allow for compaction. IMO the rule is: seq guarantees that the static child will observe the dynamic child within zero-cycles. If compaction can occur, then that means that the static child cannot observe the dynamic predecessor anyways because there is no dependence.

Does that make sense? The rule is only meaningful when there is a dependency between children. If there is no dependency, the rule does not matter.

calebmkim commented 9 months ago

Ah, I see what you're saying.

I still think there are some tricky cases here, though. For example, suppose we have:

@promotable(20) seq {
  @promotable(10) A; 
  @promotable(5) B; 
  static<5> par {..}; // par depends on *both* A and B
} 

In this case, compaction would schedule them like this:

@static<15> par {
  static<10> par {A; B;}
  static<5> par {..};  // par should start after a 10 cycle delay. 
} 

Which would break the 0-cycle transition. B would execute from [0,5) and the par would start on cycle 10.

One "solution" would be to treat the static-dynamic pair as a "package" which must be thought of as a single entity when performing compaction, but a) this still limits compaction a bit and b) we still need to think about what the language-level guarantee would be.

Another "solution" could be that the 0-cycle transition only occurs if the length of the seq is 2, but this seems like a weird rule to have.

rachitnigam commented 9 months ago

Ah, this is a good example! Note that as-late-as-possible (ALAP) schedule will still be able to satisfy the requirement of zero-cycle transition from B -> par { ... } but we cannot pick any other start time for B anymore. Now we need to decide if that's an okay restriction or not.

The relevant question is: are all schedules (ASAP, ALAP, etc.) about the same hardware cost when performing compacting? If yes, then maybe the freedom of scheduling choices is irrelevant; if not, let's see why?

sampsyo commented 9 months ago

Truly awesome; thanks for making this concrete, @calebmkim. That issue where the compiler would like to add latency between B and the static par is exactly the situation I was thinking of, but I never succeeded in writing it down this explicitly.

@rachitnigam, in this example, you're right that an ALAP schedule could still respect 0-cycle transitions from dynamic to static. I think, however, that there is a complementary counter-example for ALAP if we want to support the opposite direction too (0-cycle transitions from static to dynamic). Slightly modifying @calebmkim's example:

@promotable(20) seq {
  static<5> par {..}; // both A and B depend on this `par`
  @promotable(10) A; 
  @promotable(5) B; 
} 

Now ALAP violates the 0-cycle guarantee and you want ASAP in order to respect it. To make things worse, consider a static/dynamic/static "sandwich":

@promotable(25) seq {
  static<5> par {..}; // both A and B depend on this `par`
  @promotable(10) A; 
  @promotable(5) B; 
  static<5> par {..}; // this other `par` depends on both A and B
} 

Now, I think, there is no choice of when to run B that suffices to respect both directions of 0-cycle transition (assuming you want do parallelize A & B).

To return from specifics to generalities: it seems like it is very valuable for a rescheduling optimizer to have lots of freedom to insert delays between group activations, not just to delete them. Inserting them can sometimes (paradoxically?) let you reduce delays elsewhere, and it can certainly allow area/latency trade-offs by reducing parallelism. So it would be a shame to have seq restrict this ability by default.

rachitnigam commented 9 months ago

@sampsyo okay, these examples are a nail in the coffin of the whole "language-level guarantee" pitch I was trying to go for. I think the attribute approach is fine for now and we can eventually consider a fast seq (fast as the new keyword) that attempts to give the guarantees.

sampsyo commented 9 months ago

Cool. That's a good plan… I share your aesthetic discomfort with semantics-bearing attributes, so I dig the idea of starting with one just because it's simpler to implement and then probably turning that into a proper qualifier. (I like the spelling fast!)

rachitnigam commented 8 months ago

Okay, I'm going to mark this as "available" then! The plan is to implement the @fast attribute for now and eventually promote it to the fast keyword!

ethanuppal commented 5 months ago

Who should I discuss the syntax/semantics of @fast with before I start working on it?

rachitnigam commented 5 months ago

The issue describes the high-level stuff. Are there specific questions you have?

calebmkim commented 5 months ago

Hi Ethan-- thanks for picking this up.

The most relevant comments regarding implementation are my comment on Feb 5 and Anshuman's comment on Feb 6.

You will probably have to edit tdcc (top_down_compile_control.rs) pass to implement this. It's quite a big file-- I've found it useful to individually run passes on files to help get a high level understanding of what's going on in the pass. So for you, it might be helpful to run cargo run <filename> -p tdcc on some random Calyx design to get a better idea of what's going on in the pass (tdcc might expect you to ahve run some other passes before it like compile-invoke, so it might be something like cargo run <filename> -p compile-invoke -p any-other-passes -p tdcc).

ethanuppal commented 5 months ago

Thanks, Caleb! Are you free today to discuss this?

calebmkim commented 5 months ago

Sure-- I'm pretty free this afternoon, we can DM to find a time