Backpressure and Feedback

bgottula / noodles

Framework for DSP flowgraphs.

MIT License

1 stars 0 forks source link

Backpressure and Feedback #23

Open bgottula opened 9 years ago

bgottula commented 9 years ago

For feedback to work properly, Noodles will need the following features:

[x] Customizable maximum queue depth (should all Noodles have a default max size, or is unbounded OK?)
[x] Customizable initial queue fill, with the default being an empty queue. This will be necessary to prevent deadlock in feedback loops on startup.
[x] Blocks need a way to detect backpressure from OutputPorts in the work function. The work function in general should process samples until any of the OutputPorts is full or any of the InputPorts is empty.

jgottula commented 9 years ago

Unbounded max queue size will be problematic: source blocks never run into the "empty input queue" stop condition by definition, so they will work continually until their output queue fills up. Which would be forever if the max queue size is unbounded.

Say the max queue size is 100. All of the source blocks will work until they've output 100 samples. Then they will stop (because their output queue is full) and let the downstream blocks start doing work, now that their input queues contain samples.

I will work on adding a mechanism for specifying the max queue size, a mechanism for specifying initial queue fill, and a mechanism for Block code to ask OutputPorts how much space is available in its output queues. Should be pretty easy.

I'm going to add a virtual function (haven't decided on the name just yet) to class Block that will return a number representing "how much work could this block do right now, given the number of samples in its input queue(s) and how close to full its output queue(s) are". Each block subclass will need to implement this function, because how this is calculated may vary depending on what each particular type of block does. Usually it will be pretty trivial: for a 1-input, 1-output, 1:1 input/output rate ratio block, it would just be:

min(the_input_queue.size(), the_output_queue.max_size() - the_output_queue.size())

The Graph code will use the output of this function to determine which blocks are able to do work at the present time. Obviously the most straightforward thing to do is to have Graph::run not call Block::work on blocks that say they can do 0 work right now. A (possibly) even better thing would be to have Graph::run always call Block::work on whichever block can currently do the most work. I was thinking that I might store the graph's Block * pointers in a priority queue, and have the "amount of work I can do right now" number be the priority.

It's pretty much certain that implementing it this way would result in the work functions being called in a different order and number of times than currently (it's basically just round-robin right now). What I'm not certain of is whether that's a bad thing or not. I guess as long as we're always calling work functions when they have inputs available and have output queue space available, then we're always doing the right thing... right? Does this violate any important constraints?

Another thing I'm not completely sure about is whether the "amount of work I can do right now" values will be comparable between types of blocks. (And whether or not this actually matters.) Since different blocks have different input/output rate ratios, would an interpolator that returns 10 for this function be able to do the same amount of "work" as a decimator that returns 10 for this function (given that the decimator needs 4x as many input samples to produce the same number of output samples as the interpolator)? Perhaps "the amount of work I can do right now" should be defined as the number of samples that can be outputted right now. But then, how would that work for blocks with multiple output ports? I may be overthinking this.

Name suggestions for this value (and for the class Block member function that calculates it) are welcome.

bgottula commented 9 years ago

Wow lots of great thoughts.

At one point the code for source blocks defined a max sample count per call to work, but bounding the queue size is also a valid approach.

Regarding the priority scheduling, that sounds neat! It may even be possible to run each block in its own thread, which might eliminate the need for a scheduler. Each work function could just be a while(1) loop blocking on queues. Running on a big 12 core server could be pretty quick! If we go that route we should use portable threading code (boost).

The one implication I'm a bit worried about is that for verification of a large system like an entire receiver we will write a CSV file with a column for each node in the system. A new row is written for each simulated cycle, but the time alignment of samples is critical. It may be possible to accomplish this in noodles by making the max queue length 1 for all noodles, but that remains to be seen. On Sep 23, 2014 2:14 PM, "Justin Gottula" notifications@github.com wrote:

Unbounded max queue size will be problematic: source blocks never run into the "empty input queue" stop condition by definition, so they will work continually until their output queue fills up. Which would be forever if the max queue size is unbounded.

Say the max queue size is 100. All of the source blocks will work until they've output 100 samples. Then they will stop (because their output queue is full) and let the downstream blocks start doing work, now that their

input queues contain samples.

I will work on adding a mechanism for specifying the max queue size, a mechanism for specifying initial queue fill, and a mechanism for Block code to ask OutputPorts how much space is available in its output queues.

Should be pretty easy.

I'm going to add a virtual function (haven't decided on the name just yet) to class Block that will return a number representing "how much work could this block do right now, given the number of samples in its input queue(s) and how close to full its output queue(s) are". Each block subclass will need to implement this function, because how this is calculated may vary depending on what each particular type of block does. Usually it will be pretty trivial: for a 1-input, 1-output, 1:1 input/output rate ratio block, it would just be:

min(the_input_queue.size(), the_output_queue.max_size() - the_output_queue.size())

The Graph code will use the output of this function to determine which blocks are able to do work at the present time. Obviously the most straightforward thing to do is to have Graph::run not call Block::work on blocks that say they can do 0 work right now. A (possibly) even better thing would be to have Graph::run always call Block::work on whichever block can currently do the most work. I was thinking that I might store the graph's Block * pointers in a priority queue, and have the "amount of work I can do right now" number be the priority.

It's pretty much certain that implementing it this way would result in the work functions being called in a different order and number of times than currently (it's basically just round-robin right now). What I'm not certain of is whether that's a bad thing or not. I guess as long as we're always calling work functions when they have inputs available and have output queue space available, then we're always doing the right thing... right? Does this violate any important constraints?

Another thing I'm not completely sure about is whether the "amount of work I can do right now" values will be comparable between types of blocks. (And whether or not this actually matters.) Since different blocks have different input/output rate ratios, would an interpolator that returns 4 for this function be able to do the same amount of "work" as a decimator that returns 4 for this function? Perhaps "the amount of work I can do right now" should be defined as the number of samples that can be outputted right now. But then, how would that work for blocks with multiple output ports? I may be overthinking this.

Name suggestions for this value (and for the class Block member function that calculates it) are welcome.

— Reply to this email directly or view it on GitHub https://github.com/bgottula/noodles/issues/23#issuecomment-56590925.

jgottula commented 9 years ago

Good point about threading. If each block's work function is its own thread, and they just sleep until their inputs-not-empty and outputs-not-full conditions are true, then that really does simplify things. By sleeping when not ready to do work, we're basically signalling to the OS's scheduler when we want threads to run. And we get extra performance, in the form of concurrency, from the OS basically for free. (Has making a program threaded ever actually made things simpler? We may have violated some fundamental law of software development.)

Perhaps we should make two branches, one for the thread implementation (maybe try that one first since it sounds like it could potentially be easier to write), and another for the scheduler implementation, and see how they compare.

The thread implementation will require that STL queue (and possibly other relevant things) are thread-safe (which I think is generally the case; I can't remember if it's implementation specific or guaranteed by the standard). Failing that, just use mutexes everywhere; I think that's usually how real programmers make their software "multithreaded".

Making the program threaded does open a new can of worms, in the sense that it just generally makes debugging unpleasant and reduces the ease with which we can convince ourselves that our implementation is correct. On the other hand, assuming that we do this the Right Way (and we will do this the Right Way), we will have it so that basically the only thing in the entire program that will ever share memory or pass data between threads is the noodle code, which effectively constrains the potentially problematic parts to one section of the program.

By the way, I suspect that large parts or maybe all of the Boost thread library was merged into C++11 (by being part of C++ TR1 or otherwise). Have a look at http://www.cplusplus.com/reference/multithreading/. Boost might still have more features though. We should look into that.

Also, I'm merging branch rewrite_everything into master now.

bgottula commented 9 years ago

Yeah I was just looking into the thread safety of STL queue. Based on StackOverflow responses it is implementation dependent. This part of Boost Thread looks interesting, though it's marked as experimental: http://www.boost.org/doc/libs/1_56_0/doc/html/thread/sds.html#thread.sds.synchronized_queues

One major downside of threading besides what you mentioned is the same case I was describing for generating verification vectors for a whole system. In the current code, the vector generator (which is a friend of multiple classes) is able to execute the simulation sample-by-sample and write a row to a CSV file representing the state of multiple important nodes in the system at each instant. It's not clear how this would be possible in a multi-threaded implementation.

bgottula commented 9 years ago

I'm a little hesitant to adopt C++11 features because I'm using Visual C++ 2010 Ultimate at work and upgrading to 2013 would most likely mean dropping down to the free version which does not include profiling.

jgottula commented 9 years ago

The step-by-step verification case that you're talking about represents a relatively small portion of the times that you'll be running Noodles, and performance matters less for that case; do I have this correct?

To do step-by-step verification, here's an idea: make it so that each time any block calls outputs.put (and possibly also when any block calls inputs.get; not sure if both would be needed), the whole program will synchronize (i.e. every thread pauses; implement with a global mutex that gets held for the entirety of outputs.put and/or inputs.get) so that the state (the samples contained within all noodles) can be recorded, and then concurrent execution resumes after that. Would this approach work? It retains the concurrency of the stuff internal to the blocks (which I'm assuming is considered to be "black box" state that doesn't need to be recorded), but serializes the noodle-related (i.e., external) events so that it's possible to create a coherent series of snapshots of where the samples are.

Thinking about this further, for verification, is it necessary that block work appears "instantaneous", that is, when an input sample enters a block, nothing unrelated can happen until the work is done and the output sample exits the block? In other words, is it important to avoid situations like this:

block A takes an input sample out of noodle A; state recorded
block B takes an input sample out of noodle B; state recorded
block A puts an output sample into noodle C; state recorded
block B puts an output sample into noodle D; state recorded

If so, it might be necessary to hold the global mutex from when inputs.get is called to when outputs.put is called, in which case we are serializing everything (the block work is no longer concurrent). This still wouldn't be a tragedy, because it means we can still have a fast parallel implementation that just happens to have a slow, methodical non-parallel verification mode.

I guess you could also have to have a separate, non-threaded scheduler (the "slow edition") just for the step-by-step simulation. Would that be such a bad thing?

Have you happened to try compiling the current version in VS? I guess you'd have to redo the important bits of the Makefile in the project build configuration. I'm interested to know how standards-compliant we are right now. I think we're pretty good, aside from already using a couple of C++11 features here and there. I modified the Makefile to turn off GCC extensions (c7322dc), and there weren't any errors. I'm leaving that in from now on, just to keep things slightly stricter.

bgottula commented 9 years ago

Ideally I would only need samples at the noodles, but in practice some internal block state is needed to diagnose problems.

Those approaches sound promising. I can't tell for sure if they will work as intended. When the project gets sufficiently mature I will try to implement a real flowgraph at work and see if the same vectors come out. That might be a big refactoring project though.

I have not tried this in VS but that's a good idea. On Sep 23, 2014 10:10 PM, "Justin Gottula" notifications@github.com wrote:

The step-by-step verification case that you're talking about represents a relatively small portion of the times that you'll be running Noodles, and performance matters less for that case; do I have this correct?

To do step-by-step verification, here's an idea: make it so that each time any block calls outputs.put (and possibly also when any block calls inputs.get; not sure if both would be needed), the whole program will synchronize (i.e. every thread pauses; implement with a global mutex that gets held for the entirety of outputs.put and/or inputs.get) so that the state (the samples contained within all noodles) can be recorded, and then concurrent execution resumes after that. Would this approach work? It retains the concurrency of the stuff internal to the blocks (which I'm assuming is considered to be "black box" state that doesn't need to be recorded), but serializes the noodle-related (i.e., external) events so that it's possible to create a coherent series of snapshots of where the samples are.

Thinking about this further, for verification, is it necessary that block work appears "instantaneous", that is, when an input sample enters a block, nothing unrelated can happen until the work is done and the output sample exits the block? In other words, is it important to avoid situations like this:

block A takes an input sample out of noodle A; state recorded block B takes an input sample out of noodle B; state recorded block A puts an output sample into noodle C; state recorded block B puts an output sample into noodle D; state recorded

If so, it might be necessary to hold the global mutex from when inputs.get is called to when outputs.put is called, in which case we are serializing everything (the block work is no longer concurrent). This still wouldn't be a tragedy, because it means we can still have a fast parallel implementation that just happens to have a slow, methodical non-parallel verification mode.

I guess you could also have to have a separate, non-threaded scheduler (the "slow edition") just for the step-by-step simulation. Would that be

such a bad thing?

Have you happened to try compiling the current version in VS? I guess you'd have to redo the important bits of the Makefile in the project build configuration. I'm interested to know how standards-compliant we are right now. I think we're pretty good, aside from already using a couple of C++11 features here and there. I modified the Makefile to turn off GCC extensions (c7322dc https://github.com/bgottula/noodles/commit/c7322dc164e3163d3fd4df9ec8d78696d63656e8), and there weren't any errors. I'm leaving that in from now on, just to keep things slightly stricter.

— Reply to this email directly or view it on GitHub https://github.com/bgottula/noodles/issues/23#issuecomment-56625531.

jgottula commented 9 years ago

All of the features mentioned in the first post of this issue have been implemented at this point. I think we should leave the issue open though because there's plenty of good multithreading and test-vector-generation related points in here.