cucapra / dahlia

Time-sensitive affine types for predictable hardware generation
https://capra.cs.cornell.edu/dahlia
MIT License
134 stars 8 forks source link

Validating claims #227

Closed rachitnigam closed 4 years ago

rachitnigam commented 5 years ago

We're making several claims in the paper about what makes HLS hard and what kinds of error pop up when working with them. From my perspective, to justify these claims we need to run non-benchmarking experiments, collect issues from Xilinx help forums, document the error messages HLS provides, and look at the verilog designs generated.

Feel free to correct the claims, add more experiments, or propose a different empirical way of validating them.

Claims to validate:

  1. Unrolling/partitioning errors are important and hard to debug.

    • [ ] important: Easy to validate. Manual claims unrolling improves performance of designs.
    • [ ] hard to debug: Demonstrate that the Fuse provides a quantitative/qualitative benefit over the partitioning error messages generated by HLS. Document the error message and see if we do better.
  2. Designs generated by HLS are unpredictable.

    • [ ] unpredictable: The unpredictability of HLS designs come from the scheduling stage. Not sure what the experiment should be here. Consider generating designs with various clocks and and "optimization effort" and see how unpredictable they look.
    • [ ] Once we show these designs are unpredictable, we have a harder job of convincing the reader that Fuse can do any better.
  3. Views provide a quantitative improvement in compilation describing complex iteration patterns.

Some of these experiments feel like moving the goal post when it comes to writing the paper (we already cover MachSuite! Why add more experiments?) but I think it's crucial for us (or at the very least, me) to have confidence in the statements above and make sure these are backed by real experiments.

sampsyo commented 5 years ago

Thanks for bringing these up; some quantitative justification for some of these claims would be really really useful.

For 1 & 2, I think there's a connection here that suggests a particular experiment that would be particularly relevant: showing that the hardware actually does get better when the unrolling matches the banking factors of the arrays that the loop accesses. That is, an experiment might look like this:

Then, the arguments for 1 & 2 go like this:

To summarize, a lot of our claim boils down to the ability to turn silent bad behavior into static errors. It's easy to show that we give static errors. But we should empirically demonstrate (yes, in a benchmark-free way) that the bad behavior we're avoiding actually exists.


For views, I feel somewhat less urgent about demonstrating something empirically. Specifically, I think the utility of views comes from expressiveness. That is, a language without them is extremely restrictive—it's not even possible to express iteration patterns that you want to express to implement important algorithms. The top-priority thing we need to demonstrate is that views are actually useful, i.e., they can combine to express lots of iteration patterns you actually want to express. This is most naturally done through benchmarks, where we have work to do to actually use views. I think the paper would be best if it were to include some qualitative discussion (i.e., examples) of how views were useful to express things that otherwise require difficult-to-analyze index math.

Of course, it would also be nice to demonstrate that our view primitives have efficient implementations. This would of course be nice to do with microbenchmarking, etc., but the reason I think it's less urgent is that we can give "existence proofs" of the index arithmetic that is necessary to implement them. The point is that, regardless of whether you use views or do the arithmetic yourself, you're going to have to use index arithmetic to get the effect you want. All views do is let the typechecker correctly analyze that index math.

I still think the experiments in https://github.com/cucapra/fuse-benchmarks/pull/93 will be nice to have, but what they amount to measuring is a comparison between the HLS tool's built-in index math vs. what it generates when that index math is explicit, for a particular implementation. Writing RTL ourselves would also be awesome, but it's less meaningful without the context of RTL for an entire application—it kind of lacks a baseline for comparison.


I've been meaning to make an issue about this, but this is a good as place as any. I think there are some other "microbenchmark" experiments we should run on our HLS tool:

rachitnigam commented 5 years ago

A different Views experiment that might be useful to demonstrate their expressiveness is writing down all the different parallel patterns for block of size up to 3x3 and see how many of them we can parallelize easily.

rachitnigam commented 5 years ago

Another suggestion from @sampsyo was to figure out how complicated the indexing instructions need to be before the design becomes terrible.

rachitnigam commented 4 years ago

Made case studies for the paper.