Validating claims - Githubissues

rachitnigam commented 5 years ago

We're making several claims in the paper about what makes HLS hard and what kinds of error pop up when working with them. From my perspective, to justify these claims we need to run non-benchmarking experiments, collect issues from Xilinx help forums, document the error messages HLS provides, and look at the verilog designs generated.

Feel free to correct the claims, add more experiments, or propose a different empirical way of validating them.

Claims to validate:

Unrolling/partitioning errors are important and hard to debug.
- [ ] important: Easy to validate. Manual claims unrolling improves performance of designs.
- [ ] hard to debug: Demonstrate that the Fuse provides a quantitative/qualitative benefit over the partitioning error messages generated by HLS. Document the error message and see if we do better.
Designs generated by HLS are unpredictable.
- [ ] unpredictable: The unpredictability of HLS designs come from the scheduling stage. Not sure what the experiment should be here. Consider generating designs with various clocks and and "optimization effort" and see how unpredictable they look.
- [ ] Once we show these designs are unpredictable, we have a harder job of convincing the reader that Fuse can do any better.
Views provide a quantitative improvement in compilation describing complex iteration patterns.
- [ ] Finish https://github.com/cucapra/fuse-benchmarks/pull/93
- [ ] Write some RTL code to show the quantitative benefits we expect to have.

Some of these experiments feel like moving the goal post when it comes to writing the paper (we already cover MachSuite! Why add more experiments?) but I think it's crucial for us (or at the very least, me) to have confidence in the statements above and make sure these are backed by real experiments.

sampsyo commented 5 years ago

Thanks for bringing these up; some quantitative justification for some of these claims would be really really useful.

For 1 & 2, I think there's a connection here that suggests a particular experiment that would be particularly relevant: showing that the hardware actually does get better when the unrolling matches the banking factors of the arrays that the loop accesses. That is, an experiment might look like this:

Declare some arrays with banking factor X.
Access them from a single loop with unrolling factor Y.
Vary the parameters X and Y to explore a reasonably-sized space and collect performance results.
Plot the hardware cost vs. these two parameters. Hopefully find that the best cost arises when X == Y.

Then, the arguments for 1 & 2 go like this:

Getting this alignment is important because it yields better performance.
Doing this in HLS is hard to debug because it silently generates inferior hardware when you get it wrong. (All the suboptimal design points justify that this is true. We should also look at the log to see what, if any, warnings occur.)
This makes HLS unpredictable because small changes in the annotations can yield widely varying performance, and increasing the unrolling factor by itself is not enough to get good performance. (Suggested narrative: "Say you're an HLS programmer, and you've heard that #pragma HLS unroll is a good way to get better performance by parallelizing the code, and you have a parallelizable loop. You unroll by 4, but your code only gets 1.3x faster [insert real number here]. Now you're stuck—you have no recourse to get the performance you want. What you need do is change this other parameter…")
We demonstrate that Fuse is more predictable by explaining that it gives you errors in the bad cases.

To summarize, a lot of our claim boils down to the ability to turn silent bad behavior into static errors. It's easy to show that we give static errors. But we should empirically demonstrate (yes, in a benchmark-free way) that the bad behavior we're avoiding actually exists.

For views, I feel somewhat less urgent about demonstrating something empirically. Specifically, I think the utility of views comes from expressiveness. That is, a language without them is extremely restrictive—it's not even possible to express iteration patterns that you want to express to implement important algorithms. The top-priority thing we need to demonstrate is that views are actually useful, i.e., they can combine to express lots of iteration patterns you actually want to express. This is most naturally done through benchmarks, where we have work to do to actually use views. I think the paper would be best if it were to include some qualitative discussion (i.e., examples) of how views were useful to express things that otherwise require difficult-to-analyze index math.

Of course, it would also be nice to demonstrate that our view primitives have efficient implementations. This would of course be nice to do with microbenchmarking, etc., but the reason I think it's less urgent is that we can give "existence proofs" of the index arithmetic that is necessary to implement them. The point is that, regardless of whether you use views or do the arithmetic yourself, you're going to have to use index arithmetic to get the effect you want. All views do is let the typechecker correctly analyze that index math.

I still think the experiments in https://github.com/cucapra/fuse-benchmarks/pull/93 will be nice to have, but what they amount to measuring is a comparison between the HLS tool's built-in index math vs. what it generates when that index math is explicit, for a particular implementation. Writing RTL ourselves would also be awesome, but it's less meaningful without the context of RTL for an entire application—it kind of lacks a baseline for comparison.

I've been meaning to make an issue about this, but this is a good as place as any. I think there are some other "microbenchmark" experiments we should run on our HLS tool:

How different is a 2D array from a 1D array accessed with index math?
How do SDSoC pragmas affect the various reports? (Do they affect the per-kernel estimates, the overall synthesis results, or both?)
Can we confirm that the “interface” cost (i.e., the cost not accounted for by any hardware function) is constant while function code varies? (Keep the SDSoC annotations constant for a function and change unrolling or something to make larger or smaller interface-free hardware.)

rachitnigam commented 5 years ago

A different Views experiment that might be useful to demonstrate their expressiveness is writing down all the different parallel patterns for block of size up to 3x3 and see how many of them we can parallelize easily.

rachitnigam commented 5 years ago

Another suggestion from @sampsyo was to figure out how complicated the indexing instructions need to be before the design becomes terrible.

rachitnigam commented 4 years ago

Made case studies for the paper.

cucapra / dahlia

Validating claims #227