B-Lang-org / bsc

Bluespec Compiler (BSC)
Other
925 stars 142 forks source link

Inefficient lifting of implicit conditions on large nested if-then-else #551

Open edowson opened 1 year ago

edowson commented 1 year ago

Hi,

I have an example where bsc fails to generate code for a polymorphic module that works with a 6x6 vector of vectors.

In the attached sources, the summed-area-table-using-regvv-float64.zip example works.

summed-area-table-using-regvv-float64.zip

However, for a polymorphic version of the module, in summed-area-table-using-regvv-float64-polymorphic.zip, bsc hangs indefinitely with memory used by bsc growing steadily in excess of 34GB, and it still doesn't stop:

summed-area-table-using-regvv-float64-polymorphic.zip

I enabled the -v -show-stats flags, and bsc hangs at this point:

modules: [mkSAT_6_6_double]

*****
code generation for mkSAT_6_6_double starts
starting expanded
quark17 commented 1 year ago

Using the flag -show-elab-progress, I can see that BSC gets hung up in computing the implicit condition for that action. The implicit condition is because the input to the logic is coming from f_in, which is a FIFO. It's also interacting with the fact that your data type is a FloatingPoint value.

I find that if I change the matrix to use Bit#(64) instead of Double, then it compiles extremely fast.

The + operation for FloatingPoint is defined as a long if-then-else tree, handling many possible situations (infinities, different forms of not-a-number, etc). This operation is being applied to values coming from state, which means that BSC can't know which branch it will take, and therefore the generated logic has to keep all branches. When you apply several operations in a row, now you have an extremely large if-then-else tree that forms. BSC is unfortunately generating a very large structure, and it is having trouble walking the large structure to do implicit condition analysis.

We can help BSC, by telling it to put a synthesis boundary around the + operation, and not to inline its internal logic. That way, the logic never merges into one giant structure. You can do this as follows:

(* noinline *)
function Double dplus(Double x, Double y);
  return \+ (x, y);
endfunction

function VV #(nj, ni, Double) fn_pass (VV #(ni, nj, Double) vvX);
    let vv1   = transpose (vvX);
    //let binop = \+  ;
    let binop = dplus ;
    let vv2   = map (sscanl (binop, 0), vv1);
    return vv2;
endfunction

When I do this, the code compiles fast.

However ...

Even for a 6 by 6 matrix, you're generating a large amount of logic, to execute in one clock cycle. Even if the design compiles, it is not going to make feasible hardware. You will want to rewrite this one Action to be a pipeline of stages, that will each do a bite-sized amount of the computation in each clock cycle. If you rewrite the design like that, you also won't run into the implicit condition problem and won't need the noinline attribute (because you won't have generated large structures).

FYI, I also made a smaller example of your issue, here: Test.bsv. We can use this to investigate BSC's inefficiency. But in any case, I recommend using noinline for now and eventually rewriting it to be feasible hardware.

quark17 commented 1 year ago

Your "non-polymorphic" version compiles faster because the input is coming from a register, not a FIFO, and therefore doesn't have an implicit condition. However, it still takes a few seconds. Using noinline, it would probably compile even faster.

Oddly, if I change your "polymorphic" version to read from a register instead of a FIFO, it completes but a little slower than your "non-polymorphic" version. I would guess that it's because the "polymorphic" version also has other differences, such as embedding the logic inside an FSM. In any case, adding noinline makes it compile fast. (But require you to change how you do polymorphism, because the type of the noinline function needs to be fixed.)

edowson commented 1 year ago

That was pretty tricky to isolate.

You will want to rewrite this one Action to be a pipeline of stages, that will each do a bite-sized amount of the computation in each clock cycle.

Could you please show me how I can do this?

I don't quite follow "which action" to rewrite and create a "pipeline of stages".

quark17 commented 1 year ago

If I understand correctly, this Action:

VV #(NI, NJ, T) vv = ... f_in.first ...;
let vv1 = fn_pass(vv);
let vv2 = fn_pass(vv1);
rg_sat <= vv2;

is instantiating 72 floating-point adders, and as many as 12 of them are applied in series. Depending on your target hardware, 72 fp adders might be a large amount of area (though possibly okay). More significantly, executing 12 fp adders in sequence will create a long logic path -- since your design is attempting to do that all in one clock cycle, that means you'll need a very low clock speed, which is probably not what you want.

If you want to run this at a faster clock speed, you'll need to re-think the architecture. What you choose is up to you, as the hardware designer. As an example, though, you could instantiate 6 fp adders in parallel, and loop over 12 clock cycles, to do the full 72 fp additions. Or, if area is not a concern, you could unroll the loop into 12 pipeline stages that each have 6 fp adders.