calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
453 stars 45 forks source link

Issues with Converting Onnx models to Calyx Code #1157

Closed calebmkim closed 1 year ago

calebmkim commented 1 year ago

Over the process of trying to convert Onnx models to Calyx code, I've run into a few issues. Some of them are rather minor and I've tried to just work around them. Nonetheless I want to record them all here just in case people had any thoughts.

I can also discuss these problems at the Calyx meeting tomorrow, or another smaller synchronous meeting too.

  1. Cycle count limit for big Calyx files So I want to verify the squeezenet.futil file is correct by executing it through verilog and comparing it to executing squeezenet through TVM. One problem is that the squeezneet files take a lot of cycles to run (I set the limit as 1 billion cycles and it exceeded this). This is interesting because I seem to remember the lenet.futil file taking only 10 minutes or so to run through verilog, whereas it took over 24 hours for the 1 billion cycle count to be reached. Maybe we should focus on DNNs that we know will produce smaller .futil files, as these will probably make resource estimation more manageable?

  2. TVM fromtext(). As mentioned in #1154, TVM fromtext() doesn't parse the file correctly when an argument %x is a tuple. Another oddity is that I recently tried to add support leaky_relu(). Running fromtext() on a test case the TVM parser gives the error: Exception: Operator nn.leaky_relu is not registered. It's attributes are {'alpha': TupleGetItemNode(Constant(0), 1)}. What's even weirder is that if I get rid of the optional argument, so go from nn.leaky_relu(%x, alpha=0.1f) to nn.leaky_relu(%x) it works. This is really weird and makes me think it may not be a problem w/fromtext() and maybe I'm not writing the syntax correctly? But I am basically just copying the format given by the output relay files that we already generate in frontends/relay/onnx_to_calyx.py.

  3. Dahlia uses fixed points but TVM uses floats. It seems like Dahlia uses fixed points. So if I wanted to transform nn.leaky_relu(%x, alpha=0.1f) to Dahlia code, it seems like Dahlia wants alpha to be a fixed point number, and if I try 0.1 as fix<32,16>, it tells me that 0.1 can't be represented as fix<32,16>. So I wrote a quick python function that approximates 0.1 to a number that can be written in fix<32,16>. Is there something that I'm missing about Dahlia, or is my python function a fine workaround?

It seems to be that 1) is more concerning than 2) which is more concerning than 3).

Also, lmk if this is not the correct place to record things like these.

rachitnigam commented 1 year ago

For (1): Hm, I'm not convinced that the squeezenet.futil file should take a billion cycles to run. One sanity test to do is running the program through the interpreter. You can do this by building the interpreter and doing fud e ... -s verilog.data <data file> --to interpreter-out. I recommend building the interpreter in release mode using cargo buld --all --release and changing you fud configuration to use the release build interpreter: fud c stages.interpreter.exec ./target/release/interp.

My hope is that it catches some obvious problem with the code that's causing an infinite loop. I would maintain this as the current hypothesis unless we have some strong reason to believe that its not.

For (3): Yes, this is intentional. Floating point hardware is extremely expensive (resource-wise) so we use fixed-point hardware. We do need to approximate things like 0.1 using fixed point. In fact, we have a whole thing that generates fixed-point approximation for e so that we can implement softmax

For (2): I'm not entirely sure about this but maybe @EclecticGriffin would have some insight? Otherwise we can dig through some code

EclecticGriffin commented 1 year ago

I suspect it's probably just an error somewhere in the parser of this TVM version, though I can do some digging later if needed. Not sure how challenging it would be to port things over to the current release but might be worthwhile in the long run

rachitnigam commented 1 year ago

Yeah, I think so too. @calebmkim is working on getting multiple kernels working for his resource sharing evaluation and it'd be nice to have TVM as a first-class frontend to future evaluation so worth investing time into it!

calebmkim commented 1 year ago

About 1), I agree, I hadn't thought about an infinite loop and that seems more likely. For running it through the interpreter, is your idea that the interpreter would give some sort of helpful error message that would let us know where an infinite loop was happening?

rachitnigam commented 1 year ago

If a loop specified a @bound and violates it, the interpreter will warn you. Otherwise you can use the debugger to see how far the execution gets by setting breakpoints in the program

calebmkim commented 1 year ago

So I get the following error in the interpreter when I run squeezenet:

Aug 29 18:20:27.983 WARN Computation over/underflow, source: main.bias_add_1x16x55x55_.add561
Aug 29 18:20:30.857 WARN Computation over/underflow, source: main.relu_1x16x55x55_.add147
Aug 29 18:20:30.857 WARN Computation over/underflow, source: main.relu_1x16x55x55_.add147
Aug 29 18:39:15.681 WARN Computation over/underflow, source: main.bias_add_1x64x55x55_.add595
Aug 29 18:39:15.681 WARN Computation over/underflow, source: main.bias_add_1x64x55x55_.add595
Aug 29 18:39:27.104 WARN Computation over/underflow, source: main.relu_1x64x55x55_.add478
Aug 29 18:39:27.104 WARN Computation over/underflow, source: main.relu_1x64x55x55_.add478
Error: invalid memory access to memory main.conv2d_1x64x55x55_1_.x60_0_0_0. Given index (0, 15, 54, 55) but memory has dimension (1, 16, 55, 55)

Interestingly, it seems like the warnings and the error both come from situations where we have a while loop with @ bound(1). And from the error messages it seems like (just a hypothesis) the loops are executing more than once

Why I think this: the add478 is a one-bit adder, meant to increment one time at the end of a @ bound(1) while loop (specifically, it has reg.in = reg.out + 1 at the end of each iteration of the while loop, and sets reg.in =0 before of the while loop starts). I'm guessing the overflow is coming from the fact that maybe we are going through the while loop again and therefore incrementing again, causing a 1 + 1 in a one-bit adder? From what I can tell, I think a similar thing is happening w/ the memory index error.

Btw, I took a look through the Calyx code and tried to find something incorrect... I didn't find anything (but I definitely could have looked closer).

rachitnigam commented 1 year ago

I think your hypothesis is correct! The likely problem is that the dahlia generated code for those loops is not generating counters that are big enough.

I don’t know what you mean by the last comment; the calyx code is definitely wrong—that’s why the interpreter warns/errors out

calebmkim commented 1 year ago

Sorry, I meant that I couldn't find exactly where the code was going wrong. It is wrong, I just can't pinpoint the exact place.

rachitnigam commented 1 year ago

Got it! I'd start by looking at all the groups that use the adders in question. What you're likely to find is there is some group that does a comparison using the adder's current value to a value that the adder can never reach. We've previously encountered a similar problem so maybe reading that issue would be helpful

cgyurgyik commented 1 year ago

Dahlia uses fixed points but TVM uses floats. It seems like Dahlia uses fixed points.

This is going to definitely play a role in differences between the floating point and fixed point outcomes. As Rachit mentioned, we've run into this before. See: exp generator with source code calyx-py/calyx/gen_exp.py

Why I think this: the add478 is a one-bit adder, meant to increment one time at the end of a @ bound(1) while loop (specifically, it has reg.in = reg.out + 1 at the end of each iteration of the while loop, and sets reg.in =0 before of the while loop starts). I'm guessing the overflow is coming from the fact that maybe we are going through the while loop again and therefore incrementing again, causing a 1 + 1 in a one-bit adder? From what I can tell, I think a similar thing is happening w/ the memory index error.

One thing that may help to confirm this hypothesis is to add some assertions directly to the Verilog primitives in question, e.g.,

 assert (some_condition) else $error("my error message");
cgyurgyik commented 1 year ago

For (2), we have run into this before: https://github.com/cucapra/calyx/issues/401

calebmkim commented 1 year ago

Update on the squeezenet.futil file stuff. I checked the interpreter, and the while loop with @ bound(1) actually does only run once. And the one-bit adder does what it's supposed to do during the one iteration of the while loop: it adds 0 + 1 = 1 into a register, which then is read from in the while comb group and the loop terminates. It does give a overflow/underflow warning, weirdly, even though I verified in the interpreter that the numbers being fed into the adder did not result in overflow.

Update/Edit So it turns out the generated Dahlia is actually incorrect for conv2d_1x64x55x55_1.

If you look at the following Dahlia generated code

def conv2d_1x64x55x55_1(x6: fix<32, 16>[1][16][55][55], squeezenet0_conv3_weight: fix<32, 16>[64][16][3][3], x10: fix<32, 16>[1][64][55][55]) = {
  for (let __b: ubit<32> = 0..1) {
            for (let __c: ubit<32> = 0..64) {
              for (let __y: ubit<32> = 0..55) {
                for (let __x: ubit<32> = 0..55) {
                  let __sum: fix<32, 16> = 0.0;

                  for (let __k: ubit<32> = 0..16) {
                    for (let __dy: ubit<32> = 0..3/*kernel_size[1]*/) {
                      for (let __dx: ubit<32> = 0..3/*kernel_size[0]*/) {
                        let __kernel_y: ubit<32> = (/*strides[0]*/1 * __y) + __dy;
                        let __kernel_x: ubit<32> = (/*strides[1]*/1 * __x) + __dx;
                      } combine {
                        __sum += x6[__b][__k][__kernel_y][__kernel_x] *
                               squeezenet0_conv3_weight[__c][__k][__dy][__dx];
                      }
                    }
                  }
                  x10[__b][__c][__y][__x] := __sum;
                }
              }
            }
          }

You can see that __kernel_x will reach a number up to (1 * __y) + __dy which will max out at 54 (y's max value) + 2 (dy's max value) = 56. However, x6's shape is 1x16x55x55, meaning that __kernel_x will be out of x6's range during certain parts of the while loop.

Two things come to mind: 1) Could this (or things like this) have been the source of the 1 billion cycle problem? Because on first glance, it doesn't look like this code will necessarily generate an infinite loop, so I feel there might be something else that I haven't caught that could be responsible for this error (however, once we fix this again we can run the interpreter again to see if it picks up on a different error).
2) I think the problem is we don't apply padding to the input, even when the relay IR says we should.

EclecticGriffin commented 1 year ago

Worth noting that the overflow warnings for adders are not always indicative of an error, since these are raised when a value is computed (combinational time) which doesn't necessarily mean that the overflowed value is used anywhere.

sampsyo commented 1 year ago

Lack of padding seems like an extremely reasonable hypothesis here! Nice work tracking it down.

As a way to see if it's responsible for running a really long time, might I suggest isolating a single layer and trying to run it? That will require "faking" some inputs on which to run the layer, but just running it on all-zeroes would probably suffice, I hope.

calebmkim commented 1 year ago

Couple of comments/questions I have after playing with the interpreter:

  1. Does using a std_slice always results in a computation overflow/underflow warning in the interpreter or does the interpreter know not to give a warning?

  2. I also have a comment on cycle count (which is becoming a concern imo). For example, consider the following Dahlia code generated for conv_2d

    def conv2d_1x64x55x55_1(x6: fix<32, 16>[1][16][55][55], squeezenet0_conv3_weight: fix<32, 16>[64][16][3][3], x10: fix<32, 16>[1][64][55][55]) = 
    {
      for (let __b: ubit<32> = 0..1) {
            for (let __c: ubit<32> = 0..64) {
              for (let __y: ubit<32> = 0..55) {
                for (let __x: ubit<32> = 0..55) {
                  let __sum: fix<32, 16> = 0.0;
                  for (let __k: ubit<32> = 0..16) {
                    for (let __dy: ubit<32> = 0..3/*kernel_size[1]*/) {
                      for (let __dx: ubit<32> = 0..3/*kernel_size[0]*/) {
                          (* loop body *)
                    }
                  }
                }
              }
            }
         }
      }

    The number of times the loop body gets executed is already 64x55x55x16x3x3 ≈ 27 million, meaning that even if our code is correct, this component will take at least 27 million x (number of cycles of loop body) to execute. The reason conv2d takes so many cycles is because of the formula. We have to iterate over 3 different variables (k, dy, and dx) just to calculate the output value for a single memory index.

So I tried to run just this single conv2d operation in Calyx. On the bright side, it gives the correct output (I explained why I think so in #1154). On the other hand, it takes a lot of cycles: when I try it with a 1 billion cycle limit it times out but when I try it with a 2^31-1 cycle limit (if I try to set it higher it gives me an error) it gives me the an output I believe is correct (so it's not an infinite loop problem). What's even weirder is that when I set the cycle limit to 2^31-1, when I get the json output, it says it only took about ~500 million cycles (536,870,910 to be exact).

All of this is fine in terms of timing, since 1 billion cycles times out in about ~11 minutes. The only problem is if I try to run the full squeezenet model through Calyx, I am certain that it will take more than 2^31-1 cycles. I've thought of two options.

1) Test smaller chunks of the squeezenet generated model (i.e., compare the TVM execution to the fud Calyx file execution).

2) Only model neural networks that are small enough that they won't take a really long time to execute.

Maybe this is something we can discuss at the Calyx meeting this Tuesday.

rachitnigam commented 1 year ago

Awesome job on sleuthing around the squeezenet problems. I do agree that it is troubling to see 500 million cycles for just one layer. A couple of experiments to do:

  1. How many cycles does Lenet take to run? This gives us a lower bound to think about.
  2. What happens if we disable some of the passes you worked on during the summer? We know some of them increase the cycle count a bunch.
  3. Can we use the same analytical reasoning for each layer to get an upper bound for the number of cycles the squeezenet should take? Ideally, infer-static-timing would give us this number but it doesn’t.
  4. During the Calyx meeting, we can think of multiple ways to attack this problem. We can think about whether implementing tdst (#940) and making remove-comb-groups optional (#911) will improve the number of cycles.
  5. We can also look into what people do to make large verilator designs scale up. For example by generating better verilog so that Verilator can execute the program faster.
calebmkim commented 1 year ago
  1. Lenet takes about 78 million cycles to run. I also tried out how long a single conv2d in lenet takes, and it was about 10 million cycles (as opposed to 500 million for squeezenet). I think this is in large part due to the fact that lenet uses much smaller memories than squeezenet. (1x20x24x24 vs. 1x64x55x55).
  2. Disabling group2seq and group2nvoke unfortunately doesn't really do much to decrease the cycle count, since I think most of the cycles come from the simple fact that there are so many iterations of each nested while loop.
  3. Squeezenet.futil has about 100 invoke statements (as opposed to about 18 for lenet). From what I can guess, I think the most expensive operations (in terms of cycle count) are conv2d and avg/max_pool2d since these operations need multiple iterations of a nested while loop to calculate a single value for a memory location's output. I will test what I think will be a cheaper operation (perhaps bias_add) and update this comment when I get the cycle numbers. Update: I just ran bias_add on a 1x64x55x55 sized memory input (this is exactly the same sized memory input as the 500 million cycle conv2d) and it only took 1.9 million cycles. So 250 times slower for conv2d-- I think this is because 1)conv2d iterates through multiple while loops just to get a single value for the output memory and 2) conv2d has 2 multiplication operations in its body (plus uses more registers) whereas bias_add does not.
  4. (this goes for 5 too) Sounds good. I think we should talk synchronously about our options here.
rachitnigam commented 1 year ago

Ah, got it. Cycle count is one part of the equation. One "easy" way to reduce cycle count is by unrolling loops in Dahlia (not really easy because you need to partition the memories too) but that increases the complexity of the circuit you need to simulate every cycle.

I think the situation is slightly worse than the sum of its parts because Verilator, unlike Cider, has to simulate the entire circuit, even though most of the invokes are not computing anything interesting at all. We need to figure out some way to deal with that problem as well.

One experiment that we should talk about is unrolling the conv2d example above and seeing if that helps with speeding up the wall clock time at all. If it does, we have at least one tool to help us move forward.

rachitnigam commented 1 year ago

This paper might also help us figure out how to generate faster to simulate Verilog!

rachitnigam commented 1 year ago

@calebmkim does moving to the new version of TVM + your previous fixes help resolve this issue?

calebmkim commented 1 year ago

So one thing I forgot to do when testing the updated TVM version was going from ONNX to Calyx. I think most of it should be fine, there is just a bit of code in the relay_visitor.py file that needs to change, but I opened a PR #1197 that I think this will close this issue.