add accouting for work/step costs

work: total number of proc "steps"
step: number of proc schedulings
- when buffers are 1 long, should be same as work
- when buffers are inf long, should be = # of "active" streams (need working withctrl)
- maybe should schedule all non-blocked procs for execution in same step?

In the latest update by now, I've added work/step cost to the streaming interpreter (although there is still some inaccuracy or inconsistency in this model). In your last small bullet:

maybe should schedule all non-blocked procs for execution in same step?

As we've discussed before, this means that we'll measure the cost on a MIMD machine. Then do you think we also need to modify the high-level cost model we already have for SNESL as well as the low-level one for the fully-eager interpreter, so that all of these models are consistent and comparable?

No, I think that, for now, we can leave the high-level and non-streaming interpreters alone, and maybe revisit them once we have recursion working completely in the streaming interpreter, and also have some accounting for space usage. (Because in the high-level model, the expected space usage of e_1+e_2 also depends on whether the e_i are nominally evaluated in parallel or sequentially, if they both use a lot of temporary space, e.g, if they are both reducePlus(&n_i) for some large n_1 and n_2.) As for the eager interpreter, I expect that it can eventually be completely subsumed by the streaming one, where one lets the buffer size be unbounded (or set it to some ridiculously high value, such as 10^9).

Even if you implement the two-phase execution model in the streaming interpreter (i.e., in each round, first identify all non-blocked processes, and then run them all), you still have the choice of whether to count the individual process invocations in the second phase as separate steps (SIMD) or one step (MIMD) for the cost model, possibly determined by a boolean flag. I'm just thinking that the two-phase model may be preferable because (1) it's easier to adapt to a practically realistic, non-polling scheduler, where you don't need to re-check all processes on every scheduling round, but have them automatically marked as runnable as soon as some other process modifies the buffer they are blocked on; and more importantly (2) it doesn't depend on some arbitrary scheduling order, so it may be easier to analyze, especially in regards to the emptiness test in WithCtrl. (Note that the "stealing policy", i.e., how to break a non-fatal deadlock, will still introduce some arbitrariness, though.)

One other thing: you are currently calculating the ratios between the high-level and low-level costs. One more number that would be useful to have is S_L / (W_H/bs + S_H). If we take the buffer size to represent the number of processors (so each low-level step can happen in constant time), this compares the actual running time of the compiled code on bs processors against the running time predicted by the high-level model and Brent's lemma. (In fact, maybe you should simply report this number instead of the current S_L / S_H, because that one is meaningless when bs is not near-infinite.)

And in any case, it may be useful to report the ratios with one digit after the decimal point, e.g., 4.0 instead of just 4; but more than one digit is probably not relevant, since the number seems to generally fall in the range 3-6 or so.

I'm just thinking that the two-phase model may be preferable because (1) it's easier to adapt to a practically realistic, non-polling scheduler, ... (2) it doesn't depend on some arbitrary scheduling order, so it may be easier to analyze, especially in regards to the emptiness test in WithCtrl.

I think I may understand the basic meaning, but, based on my current knowledge, I can't imagine quite clearly the whole picture of the expected streaming interpreter (or the entire execution model of SNESL). For example, why is it easier to adapt to a practically realistic, non-polling scheduler, or what do you mean by a practically realistic scheduler? Regarding a real implementation environment of SNESL, I have no idea about how much work we can do with scheduling, or what kind of problems/limitations/possibilities there are for the scheduler.

So could you explain this more detailed or show me some even clearer goals in both short and long run (maybe in this week's meeting)?

Btw, as I understand, the scheduling strategy Frederik used, which he called loop scheduling in his PhD thesis, looks similar to the one I just implemented, in general.

Loop scheduling executes the transducers in succession from the first to the last, starting over unless the network has reached completion. The transducers fire regardless of the fullness of the buffers. Consequently, a transducer may fire on almost empty input, which is sub-optimal in theory. In particular, we break the step part of the work/step cost model.

Yes, let's talk about this tomorrow.

By the latest update 6e3ab620f262b9ac7ad3316f3665dc8fba5ad876, two-phase scheduling has been added in the streaming interpreter. Here are the main tasks related.

SId levels Before the runtime starts, another preparation work needs to be done, that is to generate a level list: the level classification of the nodes in the DAG. It is generated by a breadth-first traversal of the DAG. The root nodes, such as Ctrl, EmptyCtrl, have the lowest level 0; other nodes' level is increased by 1 from its supplier's level. (If a node has more than one suppliers, its level is decided by 1 plus the level of the supplier who has the highest level.)

type Levels = [[SId]]   -- level list data structure

For example, for the expression {1,2}, its DAG as follows:

It's clear that S1,s2 and S3 are independent, so can be runnable in the same step, and the final level list we'll get is [[0],[1,2,3],[4,5],[6,7]].

This Levels is my solution to the last small bullet: (see further explanation in the following point 3)

maybe should schedule all non-blocked procs for execution in same step?

BList To keep track of the blocking information (just read blocking, i.e., Pin proc with Filling state) during the execution, I added the following type BList as a state component to the SvcodeP monad.

-- blocking list
 -- (s1, Just s2): s1 is blocking at reading from s2
 -- (s1, Nothing): s1 is runnable
type BList = [(SId, Maybe SId)]

Two-phase scheduling In the implementation of two-phase scheduling, Procs are executed from the lowest level to the highest one. The runnable Procs of each level are picked according to the current BList. Also, the cost of these Procs can be measured as 1 step if MIMD model is specified. See details from the code:

-- `SIntsr`: all stream definitions ; `Int`: bufSize; Levels: level list;  Bool: MIMD/SIMD flag
twoPhaSche :: [SInstr] -> Int -> Levels -> Bool -> SvcodeP ()
twoPhaSche _ _ [] _ = return ()  
twoPhaSche ss bufSize (level:ls) mflag =
do runnable <- filterM (\n -> isRunnable n) level    
 defs <- mapM (getInstr ss) runnable  -- pick out all the runnable instrs

 (_,s0) <- getCost
 bls <- mapM (sInstrInterp bufSize) defs  -- and run them     
 (w1,s1) <- getCost

 mapM_ setBlock $ zip runnable bls  -- update BList

 if mflag -- MIMD flag: fix the step cost
  then (if s1 > s0 then setCost (w1,s0+1) else setCost (w1,s1)) 
  else return ()

 -- check if it's necessary to run the rest levels
 if all (\b -> case b of  
                 Just _ -> True
                 _ -> False) 
         bls  
   then return ()    
   else twoPhaSche ss bufSize ls mflag  -- loop to run the next level

The last if-then-else check is not a must-have, since it may be possible that even all the SIds of current level is blocking, some SIds of its next level are still runnable, so just preventing from executing them seems unfair. My consideration is that the probability of the case where all or most of them are not runnable seems greater, and the overhead of invoking a kernel for each of these streams (but do nothing) should be expensive.

WithCtrl An interesting observation is that the instruction WithCtrl ctrl code retSids is not needed to be interpreted independently anymore (i.e., no sInstrInterp _ (WithCtrl _ _ _) = ...) in the two-phase scheduling streaming interpreter, for the following reasons:
- Two-phase scheduling picks out runnable SIds to execute, rather than executing SInstr (which includes WithCtrl _ _ _) one by one
- As supplier information is already maintained in the state of each stream, we don't need to set ctrl back and forth during runtime. (so the ctrl component in the monad should also be useless anymore.)
- In the eager interpreter, we use WithCtrl to skip counting the cost of code (as well as executing code) when ctrl is empty because we'll count each SDef sid sexp as 1 step. In the current cost model of streaming interpreter, one step is only added when the buffer switches from state Filling [] to Draining _ and back to Filling [](or Eos). So the SIds inside code will cost nothing (or just constant work, 0 step) if it doesn't produce anything at all. Also, there is no need to skip these SIds, as their buffer state will become Eos immediately.
cost model
- work: is just the total amount of Pin and Pout
- step: as have explained in Point 3, is the total switch time from Filling [] to Draining _ and back to Filling [](or Eos)

So the step part is different from the requirement:

when buffers are 1 long, should be same as work

but it would be same as the eager interpreter when buffers are infinitely long. Maybe we should discuss this tomorrow.

Ginko-X / Streaming_NESL

add accouting for work/step costs #19