Closed rachitnigam closed 3 months ago
That is, to allow some version of
seq { a ; b }
wherea
is dynamic andb
is static where we can guarantee that b runs in the same cycle where a says it is done.
The two relevant issues here both have to do with the early-transitions
flag for the dynamic compilation pass (tdcc
): #662 and #621.
The high-level problem is that, in general, we cannot transition from a
to b
in the same cycle when generating dynamic FSMs because the done
condition of b
could be the same as a
. However, note that this problem cannot occur if b
is static
because it does not have a done
condition. This means that it might be possible to always guarantee that transitions from a dynamic group into a static island takes zero cycles. @calebmkim & @paili0628 anything I'm missing here?
Another question worth asking is that can transitions out of static islands be zero cycle as well? My guess is that the current answer is no because we generate a dynamic wrapper for static islands which means we adopt the 1-cycle delay in dynamic-to-dynamic transitions.
The high-level problem is that, in general, we cannot transition from a to b in the same cycle when generating dynamic FSMs because the done condition of b could be the same as a. However, note that this problem cannot occur if b is static because it does not have a done condition. This means that it might be possible to always guarantee that transitions from a dynamic group into a static island takes zero cycles. @calebmkim & @paili0628 anything I'm missing here?
Yeah, I think this is right. Although to rephrase slightly, we are technically giving b
a done
condition "under the hood" by adding a wrapper. We just know b
's done signal won't be the same as the done signal of a
, since we are the ones creating the register for the done
signal.
Another question worth asking is that can transitions out of static islands be zero cycle as well? My guess is that the current answer is no because we generate a dynamic wrapper for static islands which means we adopt the 1-cycle delay in dynamic-to-dynamic transitions.
I think this goes back to the previous comment. Suppose we have seq {a; b;}
but this time a is static and b is dynamic. We already know b
's done condition, and we know what a
's done condition will be when we compile it: so if we can guarantee they'll be different, then is it possible to do an early transition?
One thing I'm realizing is that, if we do this, maybe we should move the wrapping logic from compile-static
to tdcc
... or somehow have some way for the tdcc
pass to know which groups were static islands and which were plain dynamic groups.
so if we can guarantee they'll be different, then is it possible to do an early transition?
This was thinking too but this is a fact about dynamic -> dynamic transitions too. It seems that maybe we can unconditionally guarantee that transitions from dynamic to static will be zero cycle which might be enough for what the YXI team needs.
One thing I'm realizing is that, if we do this, maybe we should move the wrapping logic
Yeah, this makes sense. There is a lot of optimizations here with optimizing wrapper generation and transitions that we haven't explored. I think these could be high-impact because FSMs are likely the worst offenders in terms of resource blow up.
Really good points here. To summarize, we believe it is possible for the compiler to provide zero-cycle transitions in seq { a ; b }
when it can be certain that a[done] & b[done]
is UNSAT. There are two things we can do with this information:
a[done] & b[done]
is guaranteed to be UNSAT when b
is a static group to provide some kind of language-level guarantee.I think one question we'd have to answer for option 2 would be: what does the surface-level, syntactic rule look like that implies this guarantee? It could just be this:
In any dynamic
seq
, consider every adjacent pair of child statements. (They need not be group activations.) If the first statement in the pair is dynamic and the second is static, it is guaranteed to start executing during the cycle that the first statement becomes done.
I don't like this very much because it means that even a dynamic seq
can tie the compiler's hands w/r/t scheduling. We can't do schedule compaction on any seq
that contains this adjacent pairing, for example.
An alternative (which would be pretty heavyweight) would be to introduce a new, special control operator just for this purpose.
The opportunistic stuff is an optimization so I won't say much about it. You're right about the trade-offs: no semantic guarantees and therefore not particularly useful timing-precise reasoning.
I was thinking of a rule of the form that "entering a static control program is guaranteed to be zero-cycle". However, I agree with you that this is putting constraints on the way dynamic scheduling works. Unfortunately, I don't see a way out of it: if we want to provide guarantees between the timing interactions between dynamic and static control, we'll have to make statements about the scheduling policies for dynamic operators. We can try adding attributes but that doesn't really address the problem.
I suppose my high-level point about the dilemma here is that we probably want to support both modes. That is, seq { a ; b }
should keep its current guarantees and remain schedule-able, for the general case where the code is not timing-sensitive and you just want the computation done as quickly as possible. But perhaps we want to invent something else that is more in the vein of static seq
, in the sense that it is useful for carefully orchestrating time, but that does not require all its children to be static.
How do you feel about making the "slow transition" behavior opt-in in the same we've been discussing adding a new attribute to make static
control programs compactable? That is, in the normal case, we will generate fast transitions from dynamic to static sub-programs but if you give us an attribute to say "I don't care", then we are allowed to generate slow transitions?
I'm hesitant of introducing a new control operator with these semantics because then we have three different kinds of sequencing and I don't know what the guideline for which is a "good default" will be. If we already know that the good default is "fast transition", then let's make seq
default to that?
It could work! Yeah, I think what's confusing here is that I also don't know what the right default is: schedulable vs. save-a-cycle. There is a reasonable argument to be made that "schedulable" is the right default because it reflects "normal" computations that don't need to interface with the outside world. Especially if paired with a best-effort optimization that saves a cycle in most cases anyway?
I think defining which computations potentially interact with the outside world will become increasingly difficult, especially in presence of ref
cells. If that is the case, then we should make the precise behavior ("save a cycle") the default and allow programs to opt out of it.
I would also reframe "save a cycle" to something like "dynamic to static transition guarantee" to imply that this isn't just an optimization; it's a guarantee of the compiler that user-level programs can rely on.
During a synchronous discussion, we were trying to generalize the problem expressed in the original post: in other words, come up with other situations in which we need cycle-precise behavior for control between the static/dynamic boundary.
For example, one scenario we were thinking of was the following If we had something like:
while port {
seq {
// static child
// dynamic child
}
}
What if we need the backwards edge from dynamic child -> static child to be a 0 cycle transition?
@nathanielnrn can you think of any other scenarios in the yxi project where you would need 0 cycle transition behavior, besides the scenario we have already identified?
It seems to me like the dynamic ->
static description covers the cases we are encountering in AXI.
@nathanielnrn I think Caleb is asking if there are specific, concrete examples you can think of. For example, one possible example is a burst transfer where, once the valid
signal is asserted, the component needs to start reading values starting the next cycle for n
cycles where n
is the burst size:
seq {
wait_for_valid;
repeat N {
par { incr_idx; write_inp }
}
}
@calebmkim and I spoke synchronously about this. The scenarios we came up with are dynamic ->
static transitions like those discussed above as well as a likely-useful optimization for transitions between the end of while loops and the start of them.
A reduced example would be something like
while perform_writes.out {
block_transfer;
perform_write_transfer; //having a group like this would require 0 transition dynamic -> static
}
If we had some guarantee on the transition between the end of perform_write_transfer
to the start of block_transfer
in between while cycles, we could likely save on some boiler plate assignment/groups that exist in the current implementation to guarantee write_transfer
handshakes happen for a single cycle. This would almost definitely increase the throughput of our interface down the road.
This is great; thanks for the synopsis. And interesting that this example contains both static->dynamic and dynamic->static transitions!
Another bump on this issue because I'd like to make progress on this. @calebmkim thoughts on taking the next open slot in the Calyx meeting to discuss this issue and maybe propose concrete implementation strategy?
Just took the next open slot
Based on the synchronous discussion today, it seems like we probably do not want to introduce a zero-cycle transition as a guarantee of the language, as it would infect the scheduling of seqs and would prevent things like compaction.
The option that we had in mind was to add a minimally expressive extension as an attribute: we can call it @qt
for quick-transition. It can be attached to a dynamic seq
and it will guarantee zero-cycle transitions between each child of the seq
. To support this guarantee, we require that any @qt
seqs alternate between static and dynamic children. In other words, if we have
@qt seq {
A;
B;
C;
D;
}
A->B->C->D must either be Static->Dynamic->Static->Dynamic or Dynamic->Static->Dynamic->Static.
From the synchronous discussion, it seems like the specificity is the point. We normally don't want to use this, but in case you have some specific timing constraints that you need to be met, then we will offer this special attribute to you.
@rachitnigam do you have thoughts?
We are limiting this to the alternating style because any non-alternating case can just be crafted using the alternating style (above) plus calls to appropriate control operators.
Say you need:
@qt seq {
s1;
s2;
d1;
d2;
s3
}
where s*
s are static and d*
s are dynamic.
Too bad, you must write:
@qt seq {
static seq { s1; s2; };
seq { d1; d2; };
s3
}
This is neat because it crisps up what the promise of @qt
is: my immediate children will enjoy quick transitions, I don't know what will happen below them.
Perhaps silly, but do we need to do anything to prevent:
@qt seq {
d1;
empty_static_group_that_takes_time_0;
d2
}
?
Are static groups with latency 0 even allowed?
Zero-cycle groups or static groups are not allowed. If a program provides a zero-cycle group, it leads to undefined behavior (miscompilation, specifically)
I remain iffy about attributes as a mechanism to add correctness guarantees to the program. Specifically, if the compiler does not guarantee that it'll preserve or even respect attributes, I feel that the default behavior should be the one that preserves correctness.
it would infect the scheduling of seqs and would prevent things like compaction
I'm not convinced about this yet. Compaction only works for @promote
blocks (using the new parlance from #1896). Can we write down a concrete program that exemplifies this? If we can't I will continue advocating for language-level semantics for early transitions.
CC @sampsyo on this last comment because I think we disagree on this point and maybe I'm missing something obvious
@sampsyo can also give his own answer, but here is a minimum example of where I think the language-level guarantee would affect our compaction abilities.
Suppose we have the following seq
:
@promotable(n + m) seq {
@promotable(n) A;
static<m> par {.. }
}
In this case, compaction
would not be allowed to parallelize the execution of A
and the static<m> par
.
If the seq
had more children, then we would be able to compact other parts of it, but we'd have to think carefully about what the language-level guarantees are and what types of compaction they allow.
In this case, compaction would not be allowed to parallelize the execution of A and the static
par.
So this is what I don't understand. If A does not have any dependency on par { ... }
, the zero-cycle transition semantics should allow for compaction. IMO the rule is: seq
guarantees that the static
child will observe the dynamic child within zero-cycles. If compaction can occur, then that means that the static child cannot observe the dynamic predecessor anyways because there is no dependence.
Does that make sense? The rule is only meaningful when there is a dependency between children. If there is no dependency, the rule does not matter.
Ah, I see what you're saying.
I still think there are some tricky cases here, though. For example, suppose we have:
@promotable(20) seq {
@promotable(10) A;
@promotable(5) B;
static<5> par {..}; // par depends on *both* A and B
}
In this case, compaction would schedule them like this:
@static<15> par {
static<10> par {A; B;}
static<5> par {..}; // par should start after a 10 cycle delay.
}
Which would break the 0-cycle transition. B would execute from [0,5)
and the par
would start on cycle 10.
One "solution" would be to treat the static-dynamic pair as a "package" which must be thought of as a single entity when performing compaction, but a) this still limits compaction a bit and b) we still need to think about what the language-level guarantee would be.
Another "solution" could be that the 0-cycle transition only occurs if the length of the seq
is 2, but this seems like a weird rule to have.
Ah, this is a good example! Note that as-late-as-possible (ALAP) schedule will still be able to satisfy the requirement of zero-cycle transition from B -> par { ... }
but we cannot pick any other start time for B
anymore. Now we need to decide if that's an okay restriction or not.
The relevant question is: are all schedules (ASAP, ALAP, etc.) about the same hardware cost when performing compacting? If yes, then maybe the freedom of scheduling choices is irrelevant; if not, let's see why?
Truly awesome; thanks for making this concrete, @calebmkim. That issue where the compiler would like to add latency between B
and the static par
is exactly the situation I was thinking of, but I never succeeded in writing it down this explicitly.
@rachitnigam, in this example, you're right that an ALAP schedule could still respect 0-cycle transitions from dynamic to static. I think, however, that there is a complementary counter-example for ALAP if we want to support the opposite direction too (0-cycle transitions from static to dynamic). Slightly modifying @calebmkim's example:
@promotable(20) seq {
static<5> par {..}; // both A and B depend on this `par`
@promotable(10) A;
@promotable(5) B;
}
Now ALAP violates the 0-cycle guarantee and you want ASAP in order to respect it. To make things worse, consider a static/dynamic/static "sandwich":
@promotable(25) seq {
static<5> par {..}; // both A and B depend on this `par`
@promotable(10) A;
@promotable(5) B;
static<5> par {..}; // this other `par` depends on both A and B
}
Now, I think, there is no choice of when to run B
that suffices to respect both directions of 0-cycle transition (assuming you want do parallelize A
& B
).
To return from specifics to generalities: it seems like it is very valuable for a rescheduling optimizer to have lots of freedom to insert delays between group activations, not just to delete them. Inserting them can sometimes (paradoxically?) let you reduce delays elsewhere, and it can certainly allow area/latency trade-offs by reducing parallelism. So it would be a shame to have seq
restrict this ability by default.
@sampsyo okay, these examples are a nail in the coffin of the whole "language-level guarantee" pitch I was trying to go for. I think the attribute approach is fine for now and we can eventually consider a fast seq
(fast
as the new keyword) that attempts to give the guarantees.
Cool. That's a good plan… I share your aesthetic discomfort with semantics-bearing attributes, so I dig the idea of starting with one just because it's simpler to implement and then probably turning that into a proper qualifier. (I like the spelling fast
!)
Okay, I'm going to mark this as "available" then! The plan is to implement the @fast
attribute for now and eventually promote it to the fast
keyword!
Who should I discuss the syntax/semantics of @fast
with before I start working on it?
The issue describes the high-level stuff. Are there specific questions you have?
Hi Ethan-- thanks for picking this up.
The most relevant comments regarding implementation are my comment on Feb 5 and Anshuman's comment on Feb 6.
You will probably have to edit tdcc (top_down_compile_control.rs
) pass to implement this.
It's quite a big file-- I've found it useful to individually run passes on files to help get a high level understanding of what's going on in the pass. So for you, it might be helpful to run cargo run <filename> -p tdcc
on some random Calyx design to get a better idea of what's going on in the pass (tdcc
might expect you to ahve run some other passes before it like compile-invoke
, so it might be something like cargo run <filename> -p compile-invoke -p any-other-passes -p tdcc
).
Thanks, Caleb! Are you free today to discuss this?
Sure-- I'm pretty free this afternoon, we can DM to find a time
To slightly expand on this:
I actually think this is a super interesting problem that has implications for static/dynamic Calyx assignments. Artistically, we would like to be able to manage an AXI transaction with a control program like this:
The problem is that AXI, of course, is super cycle-timing-sensitive. Remember that one of the core invariants is:
So the remote side can schedule these transaction cycles back-to-back, i.e., 3 transactions can happen in exactly 3 cycles. Most saliently for us, say the remote side just keeps its
valid
signal high all the time and relies on us to assert/deassertready
to do transfers. Then, if we need to take some time to process a transfer, we have to be absolutely sure we only assertready
for one cycle at a time.The problem with dynamic Calyx is that it offers no guarantees about how long it takes to go from
wait_for_valid
above todeassert_ready
. And in practice, it actually does waste at least one cycle, meaning that the remote side sees us do two transfers when we only wanted one.The big upshot of static Calyx is that it does offer these guarantees, so we would like to be able to just sprinkle the
static
everywhere get the timing we want. There is a critical problem, however: thewait_for_valid
group is fundamentally dynamic. Our "dynamic islands" approach does not support sequencing a dynamic group with a static one with a zero-cycle transition latency.So the big design question that I think this opens up is: is there some rule we can make up that would allow this kind of sequencing? That is, to allow some version of
seq { a ; b }
wherea
is dynamic andb
is static where we can guarantee thatb
runs in the same cycle wherea
says it is done. It seems hard to do in general but maybe there is something we can do for restricted cases?Originally posted by @sampsyo in https://github.com/calyxir/calyx/issues/1820#issuecomment-1866308023