Closed afilinski closed 6 years ago
In the latest update by now, I've added work/step cost to the streaming interpreter (although there is still some inaccuracy or inconsistency in this model). In your last small bullet:
maybe should schedule all non-blocked procs for execution in same step?
As we've discussed before, this means that we'll measure the cost on a MIMD machine. Then do you think we also need to modify the high-level cost model we already have for SNESL as well as the low-level one for the fully-eager interpreter, so that all of these models are consistent and comparable?
No, I think that, for now, we can leave the high-level and non-streaming interpreters alone, and maybe revisit them once we have recursion working completely in the streaming interpreter, and also have some accounting for space usage. (Because in the high-level model, the expected space usage of e_1+e_2 also depends on whether the e_i are nominally evaluated in parallel or sequentially, if they both use a lot of temporary space, e.g, if they are both reducePlus(&n_i) for some large n_1 and n_2.) As for the eager interpreter, I expect that it can eventually be completely subsumed by the streaming one, where one lets the buffer size be unbounded (or set it to some ridiculously high value, such as 10^9).
Even if you implement the two-phase execution model in the streaming interpreter (i.e., in each round, first identify all non-blocked processes, and then run them all), you still have the choice of whether to count the individual process invocations in the second phase as separate steps (SIMD) or one step (MIMD) for the cost model, possibly determined by a boolean flag. I'm just thinking that the two-phase model may be preferable because (1) it's easier to adapt to a practically realistic, non-polling scheduler, where you don't need to re-check all processes on every scheduling round, but have them automatically marked as runnable as soon as some other process modifies the buffer they are blocked on; and more importantly (2) it doesn't depend on some arbitrary scheduling order, so it may be easier to analyze, especially in regards to the emptiness test in WithCtrl. (Note that the "stealing policy", i.e., how to break a non-fatal deadlock, will still introduce some arbitrariness, though.)
One other thing: you are currently calculating the ratios between the high-level and low-level costs. One more number that would be useful to have is S_L / (W_H/bs + S_H). If we take the buffer size to represent the number of processors (so each low-level step can happen in constant time), this compares the actual running time of the compiled code on bs processors against the running time predicted by the high-level model and Brent's lemma. (In fact, maybe you should simply report this number instead of the current S_L / S_H, because that one is meaningless when bs is not near-infinite.)
And in any case, it may be useful to report the ratios with one digit after the decimal point, e.g., 4.0 instead of just 4; but more than one digit is probably not relevant, since the number seems to generally fall in the range 3-6 or so.
I'm just thinking that the two-phase model may be preferable because (1) it's easier to adapt to a practically realistic, non-polling scheduler, ... (2) it doesn't depend on some arbitrary scheduling order, so it may be easier to analyze, especially in regards to the emptiness test in WithCtrl.
I think I may understand the basic meaning, but, based on my current knowledge, I can't imagine quite clearly the whole picture of the expected streaming interpreter (or the entire execution model of SNESL). For example, why is it easier to adapt to a practically realistic, non-polling scheduler, or what do you mean by a practically realistic scheduler? Regarding a real implementation environment of SNESL, I have no idea about how much work we can do with scheduling, or what kind of problems/limitations/possibilities there are for the scheduler.
So could you explain this more detailed or show me some even clearer goals in both short and long run (maybe in this week's meeting)?
Btw, as I understand, the scheduling strategy Frederik used, which he called loop scheduling in his PhD thesis, looks similar to the one I just implemented, in general.
Loop scheduling executes the transducers in succession from the first to the last, starting over unless the network has reached completion. The transducers fire regardless of the fullness of the buffers. Consequently, a transducer may fire on almost empty input, which is sub-optimal in theory. In particular, we break the step part of the work/step cost model.
Yes, let's talk about this tomorrow.
By the latest update 6e3ab620f262b9ac7ad3316f3665dc8fba5ad876, two-phase scheduling has been added in the streaming interpreter. Here are the main tasks related.
Ctrl
, EmptyCtrl
, have the lowest level 0; other nodes' level is increased by 1 from its supplier's level. (If a node has more than one suppliers, its level is decided by 1 plus the level of the supplier who has the highest level.)type Levels = [[SId]] -- level list data structure
For example, for the expression {1,2}
, its DAG as follows:
It's clear that S1
,s2
and S3
are independent, so can be runnable in the same step, and the final level list we'll get is [[0],[1,2,3],[4,5],[6,7]]
.
This Levels
is my solution to the last small bullet: (see further explanation in the following point 3)
maybe should schedule all non-blocked procs for execution in same step?
Pin
proc with Filling
state) during the execution, I added the following type BList
as a state component to the SvcodeP
monad.-- blocking list
-- (s1, Just s2): s1 is blocking at reading from s2
-- (s1, Nothing): s1 is runnable
type BList = [(SId, Maybe SId)]
Two-phase scheduling In the implementation of two-phase scheduling, Procs are executed from the lowest level to the highest one. The runnable Procs of each level are picked according to the current BList. Also, the cost of these Procs can be measured as 1 step if MIMD model is specified. See details from the code:
-- `SIntsr`: all stream definitions ; `Int`: bufSize; Levels: level list; Bool: MIMD/SIMD flag
twoPhaSche :: [SInstr] -> Int -> Levels -> Bool -> SvcodeP ()
twoPhaSche _ _ [] _ = return ()
twoPhaSche ss bufSize (level:ls) mflag =
do runnable <- filterM (\n -> isRunnable n) level
defs <- mapM (getInstr ss) runnable -- pick out all the runnable instrs
(_,s0) <- getCost
bls <- mapM (sInstrInterp bufSize) defs -- and run them
(w1,s1) <- getCost
mapM_ setBlock $ zip runnable bls -- update BList
if mflag -- MIMD flag: fix the step cost
then (if s1 > s0 then setCost (w1,s0+1) else setCost (w1,s1))
else return ()
-- check if it's necessary to run the rest levels
if all (\b -> case b of
Just _ -> True
_ -> False)
bls
then return ()
else twoPhaSche ss bufSize ls mflag -- loop to run the next level
The last if-then-else check is not a must-have, since it may be possible that even all the SIds of current level is blocking, some SIds of its next level are still runnable, so just preventing from executing them seems unfair. My consideration is that the probability of the case where all or most of them are not runnable seems greater, and the overhead of invoking a kernel for each of these streams (but do nothing) should be expensive.
WithCtrl
An interesting observation is that the instruction WithCtrl ctrl code retSids
is not needed to be interpreted independently anymore (i.e., no sInstrInterp _ (WithCtrl _ _ _) = ...
) in the two-phase scheduling streaming interpreter, for the following reasons:
SInstr
(which includes WithCtrl _ _ _
) one by onectrl
back and forth during runtime. (so the ctrl
component in the monad should also be useless anymore.)WithCtrl
to skip counting the cost of code
(as well as executing code
) when ctrl
is empty because we'll count each SDef sid sexp
as 1 step. In the current cost model of streaming interpreter, one step is only added when the buffer switches from state Filling []
to Draining _
and back to Filling []
(or Eos
). So the SIds inside code
will cost nothing (or just constant work, 0 step) if it doesn't produce anything at all. Also, there is no need to skip these SIds, as their buffer state will become Eos
immediately.cost model
Pin
and Pout
Filling []
to Draining _
and back to Filling []
(or Eos
)So the step part is different from the requirement:
when buffers are 1 long, should be same as work
but it would be same as the eager interpreter when buffers are infinitely long. Maybe we should discuss this tomorrow.