Performance variation with input configs

zhenbinwu commented 6 years ago

Base on the current 1 layer NN, I scan the reuse factor and precision. The performance are plotted as in https://www.dropbox.com/s/jmn4b4dwt5vjtuk/Zhenbin_HLSScan.pdf?dl=0

nhanvtran commented 6 years ago

@ejk43 we were looking at the plots on slide 3 (left) and noticed that for reuse > 1, sometimes higher precision actually means less DSPs! It turns out there's also a little rise in either FFs or LUTs, but it's not big. We were wondering if you had some insight here

ejk43 commented 6 years ago

Wow! Fascinating results. Must have missed this earlier. I would expect integer widths of 18, 24, and 36 to be sweet spots, but I would not expect 36 bit-width to use less resources than 32 bits, for example.

A few thoughts here...

The DSP48 width on the Xilinx parts is 18 x 24 -- meaning that a multiply of 18 bits x 24 bits would require ONE DSP, while a multiply of 19 bits x 24 bits would require TWO DSPs, and so on.
1. Also, as currently implemented, the HLS ALLOCATION multiplier limit assumes one DSP is used per multiply-- but this is not the case for bitwidths above 18, where each multiply requires two DSPs.

@nhanvtran On further consideration, item 2 might be a breaking issue for higher precisions, say 24 x 24 or 32 x 32, especially when there's a small percentage of weights that can be "optimized out" by the HLS compiler. Even for reuse_factor=1, a bit-width precision of 32 x 32 might require II > 1 because the logic requires too many DSPs.. Personally, I'd recommend keeping bit widths under 18 in general :) BUT, for higher bit-widths, we should probably recognize that we need to double the multiplier_limit.. Does this make sense?? That might be throwing off your IIs for larger networks if you're simulating with higher precisions...

@zhenbinwu cool work! It might also be interesting to show a performance metric like "operations per second" to demonstrate the efficacy of the HLS result... Check out this paper here: https://arxiv.org/pdf/1708.02579v1.pdf which quotes max performance of the Zync 7045 as 128 G-op/s, and then shows specs for various algorithms (this is also really interesting in general to see what a good tradeoff of latency vs resources looks like in a middle-spec FPGA). For a ballpark "op/s" number on the HLS output, you'd want to plot M*200e6/II, where M = number of DSPs (slide 3) and II is the interval (slide 5) -- which could show how efficiently different parameter sets end up implemented.

nhanvtran commented 6 years ago

@ejk43 , I was thinking about this too, but are we sure that setting a limit on mul really limits the usage of exactly that many DSPs? or does it limit so many "multiplications". The reason is that even for 1 layer with resource reuse of 1, if mul = "DSP" we couldn't get a plot like this (same as @zhenbinwu 's stuff just simplified)

Then, looking at the slides further, for resource reuse = 1, the II is also always 1.

Untitled.pdf

ejk43 commented 6 years ago

True, true that's a good point. I suppose the mul limit is probably not doing what I expected after all :) need to review the manual on that one...

Do you happen to know what network size was tested here? I don't think I could find that in the slides. That would be a good data point to have anyway so we can understand how many DSPs are instantantiated vs the "theoretical" number of multiplies required by a certain network architecture.

nhanvtran commented 6 years ago

@zhenbinwu ran these scans with the 1-layer model (so "well-controlled").

well it's roughly doing what we want in that this translates to the II. so that's why i assumed that it was limiting the "multiplications" - where multiplications != DSPs. But yeah, need to look at the manual deeper and see what the other options are too...

ejk43 commented 6 years ago

Got it- I found this blub in the "allocation" description: (https://www.xilinx.com/html_docs/xilinx2017_4/sdsoc_doc/zof1504034359187.html)

core: Specifies that the ALLOCATION applies to the cores, which are the specific hardware components used to create the design (such as adders, multipliers, pipelined multipliers, and block RAM). The actual core to use is specified in the instances= option. In the case of cores, you can specify which the tool should use, or you can define a limit for the specified core.

So, I think if we're NOT using the core modifier, then the allocation directive applies to the multiplications, not the number of hardware DSPs -- which I think this is the behavior we want after all.

nhanvtran commented 6 years ago

Great! yep luckily this is what we want

jngadiub commented 6 years ago

Adding some more studies to this thread. I scan performance and resources for the 3-layers model:

https://cernbox.cern.ch/index.php/s/vNs4fBvksUvx37s

I have only the scan for a reuse factor = 1 for the moment. I first wanted to be sure that things do not crash for other reasons before scanning the other ones.

Few preliminary comments:

the <40,4> precision crashes in the syn with error "LLVM ERROR: IO failure on output stream". Still trying to figure this out.
the best performance are for <18,8> with moderate resource usage
when you start scanning <X,4> it seems you have an improvement of performance up to 12-bits word but then it becomes pretty much flat (or even decrease?!). I guess the reason is that in this particular scan we are fixing the number of integer bits to 3 (+1 for the sign) which means numbers in the range [-8,+8] and at some point increasing the decimal numbers by increasing X does not help anymore as numbers are anyways going out of range on the integer part.

Few ideas/questions for improving the scan:

scan the integer number of bits too and fix the decimal bits to something reasonable
if the model is multi-class, do we really need one plot for each class? how can we put all the info in one plot?
the project folders are quite big in size, leading to crashes. Let's try to optimize the utils in the analysis-tools repo to cope with this? In particular, it seems to me that sim folder is the largest. Should the scan code check the result of the sim before plotting any result? Currently it is not doing that.

jngadiub commented 6 years ago

Updated the scan of the 3-layer mode for reuse factor from 1 to 7

https://cernbox.cern.ch/index.php/s/jj6JMvHvkR7Fy6B

The performance do not seem to depend on the reuse factor. And the resource usage versus reuse factor and precision seems to be compatible with the scan for the 1-layer model from @zhenbinwu. Although the overall scale is larger as the model is more complex.

The crash for the <40,4> precision was due to space issues. I optimized the scan such that to avoid this problem. I am preparing a PR for the analysis-tools.

I am currently working on a scan of the integer bits.

nhanvtran commented 6 years ago

@jngadiub great! very interesting.

A few comments:

Are the inputs standardized? Why is there a drop in performance from 18,8 to 20,4? What's the range of the inputs? If they are standardized, say between -3,3 then 3 bit should be enough to capture the full range so does that mean that something happened in the weights or biases such that higher precision to the left of the decimal point is needed?
it's interesting that 12,4 is better (AUC) than 16,4. is it a rounding or statistics issue?
so it very interestingly happened again! 28,4 used less DSPs than 24,4. that's really interesting...

pierinim commented 6 years ago

Concerning the last point, do you guys understand why? The same is true for <32,4> vs <36,4>, when one looks at reuse factor >1. Since I am pretty new to all this, getting a poor-man explanation (if any) would be helpful.

benjaminkreis commented 6 years ago

I just dug into the DSP usage a tiny bit and might have the beginning of an explanation.

Some things to know: 1) If the weights weren't fixed, HLS would probably use many instances of one multiplier that will multiply any two numbers with the precision we tell it. However, in our case the weights are fixed, so when we multiply w*d, it can optimize, using less bits for w or the output, depending on the value of w. When we use a reuse factor of N>1, then each multiplier can still be optimized better than fully generic, for N different weights if each multiplier is truly used N times. 2) For each reuse factor, we limit the number of multipliers to the theoretical maximum without playing tricks from knowing the weight values. E.g. we would count a weight of 0.5 as needing a multiplier, but HLS knows it can do this with a bit shift. So HLS tends to use less than the max we tell it, but the number of tricks it can play decreases as we increase the reuse factor because each multiplier needs to work for more weights. 3) One multiplier can use multiple DSPs. It needs more as you increase the number of bits of the numbers being multiplied.

Observations (using 1 layer example, reuse=2): <24,4> uses a mix of 1-DSP and 2-DSP multipliers <28,4> uses only 2-DSP multipliers <32,4> uses a mix of 2-DSP and 4-DSP multipliers <36,4> uses mostly 4-DSP multipliers, and only a few 3-DSP multipliers

So I'm thinking that perhaps in the cases of <24,4> and <32,4>, HLS is deciding to use the smaller multipliers when it can (1 and 2 DSP multipliers, respectively), finds it can't reuse them, but it's okay because its still under the max we set. In the case of <28,4> and <36,4>, perhaps it already has to use bigger ones, and they are big enough to be reused for other weights? I'm not really sure about this yet and depends a lot on the HLS algorithm, but at least take the observation above, where we see the decrease on the boundaries of DSPs per multiplication as a clue.

I think there are some options in HLS for prioritizing different resources. Should look into that.

jngadiub commented 6 years ago

@nhanvtran answers to your comments:

1) The inputs are standardized.

Here is the distribution of weights/biases

weights

and here the distribution of the inputs

inputs

As you said, most of the inputs are in the [-2,2] range. The weights and biases also seem to be contained in a small [-1,1] range. However, I would guess that at some point you might leave the 3-bit range by summing and multiplying even such small numbers. One would have to look at the output of each layer to be sure about it. I am on it. And I think that a scan of the integer bits would also help to clarify this point. In the meanwhile you can see the distribution of the softmax outputs for two fixed point precisions and for the predicted output from keras.

precision1d

2) I am also surprised. I would expect something flat after 12,4 as for the first and second plot from the bottom left in slide 3. I am currently using a test sample of about 200K events shared among 5 possible classes. I would say that the size of the data sample is reasonable.

As a first step to understand if it's a rounding issue I naively made this plot of the difference between the <12,4> and <16,4> outputs dived by the <12,4> output as a function of the prediction from keras

relative_difference

The relative difference seems to accumulate around 2 for the whole predicted range. I plot the events giving that behaviour here

debug_relative_difference

It looks like for a significant number of events the 12a4 gives a positive value while the 16a4 gives the same number but negative. As a consequence of the fact that 16,4 values are negative while 12,4 are positive as expected (and hence closer to the expected output) makes the AUC better for 12,4 wrt 16,4. I'll look into the HLS implementation to figure out why/when this sign flip actually happens.

zhenbinwu commented 6 years ago

@ejk43 To follow up from the last discussion, I plotted the Gop/s using DSP*2E8/II as you suggested for the 1-layer NN as below: g-ops

The 2E8 is based on the clock set to 5ns. From the systhesis, there is an estimate of the real clock. Should we use that value instead of 5ns for this estimation?
Using the DSP only counting the multiplication operation. I am wondering how to get the total operation for a fair comparison

Based on the definition of computational efficiency from https://arxiv.org/pdf/1708.02579v1.pdf (the ratio of measured performance to peak performance), I further plotted the efficiency as below:

Similar plots ad Gops. The peak performance is estimated as available DSP (3600 for xc7vx690tffg1927-2) * 2E8 = 720Gop/s

jngadiub commented 6 years ago

More debugging of the difference between the outputs with <12,4> and <16,4> precisions.

It seems that with high enough decimal precision (such as for the <16,4> precision) there are more chances for the sum at the numerator of the softmax to become larger than the allowed [-8,8] range given by the fixed 4-bits for the integer part. Hence, the sign flip.

To overcome this problem one could define the exp LUT and hence, the numerator and denominator of the softmax, as floating point and then cast the division to the preferred fixed point precision. However, this affects the latency, which increases from ~60 clock cycles to ~80, for the <16,4> precision, for instance. So, I am not sure we want to go in this direction. The other resources do not change much. In case, you want to have a look I added this modified softmax here

https://github.com/jngadiub/HLS4ML/blob/master/nnet_utils/nnet_activation.h#L189-L239

I would rerun the scan with this change so that we can compare.

benjaminkreis commented 6 years ago

Interesting, @jngadiub! Softmax is just horrible for FPGAs! The exponential creates big numbers, and the division takes forever. Does anyone know of other cost functions + activations that can be used in multiclass classification where the output can still be interpreted as a probability?

I like the float idea for softmax. It's already super slow, so what's an extra 20 clocks... One question, though -- how many integer bits would you need to avoid the problem with fixed point? I wonder if we save some latency by keeping exp_res_sum as fixed point but changing just this one variable to <32,16> or something.

Total sidenote: the number of entries in the LUT is configurable with N_TABLE, though I think we've only set it to 1024 so far. It would be interesting to check how the performance depends on this choice, because we can loose precision here.

nhanvtran commented 6 years ago

@jngadiub @benjaminkreis thanks for digging in here! softmax really is the worst. that may one of the lessons we are learning here and could be something to point out simply in the paper

I tend to agree that at this point, what's an extra 20 clocks...the lesson is that you lose 0.25 microseconds if you want to use softmax...

@jngadiub By the way, for testing the performance loss from <X,4> to <X,8> I agree that plotting all the internal values should do the trick. it might be another lesson learned for setting the default performance. I wonder if it's a bottleneck at internal layers or just another result of softmax

nhanvtran commented 6 years ago

@zhenbinwu cool plots! it would be nice to keep plots like this in the scripts -- particularly when showing to groups outside of physics.

this is certainly another way to conveniently wrap up all the results into one "number" instead of DSPs,FFs,LUTs separately.

@zhenbinwu how do you define peak performance?

pierinim commented 6 years ago

About the softmax point raised by @benjaminkreis. I don't think there is a real alternative. On the other hand, one could try to play a trick and train a regression to go from the plain output to the softmax output under a specific hypothesis. One would cut on the output of the regression and avoid the softmax. Pros: we skip the softmax issue. Cons: we need N of these small regression networks out of the N classes we consider, each starting from the N neurons of the classification before the softmax. Not sure if resource-wise we would gain. But we could give it a try. Also, the probabilistic meaning will be preserved "in average" but the sum of the outputs of the N regressions might not necessarily sum up to one. Depending on the tagger performance (and online-offline performance differences) this additional step might cost an additional resolution term on the turn-on.

zhenbinwu commented 6 years ago

@nhanvtran Sure, the script is updated on the analysis-tools

For peak performance, I consider the max DSP of this FPGA (3600), with the interval as 1. Thus the peak performance is 3600 op* 2E8Hz/1 = 720Gop/s

nhanvtran commented 6 years ago

Ok, yeah, this may be a more natural way then to report resource usage for people outside of physics. in the end it's a way to write DSP usage agnostically for any FPGA, which is good.

jngadiub commented 6 years ago

I repeated the scan for a reuse factor = 1 and compare performance and resources for the floating point versus fixed point softmax, where the fixed point precision is the same as for the rest

https://cernbox.cern.ch/index.php/s/BqzaMJWRgNbdZ0P

As expected and we already discussed the performance are partially recovered while the latency becomes longer. Although I don't quite get why at some point this trend is inverted. In any case, there should be some kind of trade between the precision of the softmax and the precision of the rest. I could play around a bit thanks to the latest commit from @jmduarte.

@nhanvtran At this point I think that the residual loss in performance is due to the fact that we are limiting the softmax between -8 and 8 and the integer part in the scan is at the same time fixed to 4 bits (so [-8,8] numbers in input to the softmax). One should probably scan the integer bits together with the softmax "amplitude".

nhanvtran commented 6 years ago

thanks @jngadiub

so one thing i don't understand is the interplay between the default precision and the softmax.

For <18,8> you have no loss in performance while for <x,4> you have a loss in performance which is partially recovered by floating point softmax. What if you scanned in <x,8> or <x,6>?

nhanvtran commented 6 years ago

@jngadiub @zhenbinwu

I just added a PR #45 that should fix the II variations that you see in the resource reuse scans. Assuming it passes testing, let's try this for the latest scan?

zhenbinwu commented 6 years ago

The feature of the DSP usage seems to be depending on the version of vivado_HLS.

The previous plots have been generated with Vivado_HLS v2016.4, as below dsp48e_v2016

Once I switched to Vivado_HLS v2017.2, I got the below dsp48e_v2017

Another study of FPGA dependency seems consistency. I got similar plot for xc7vx690tffg1927-2 and xcku115-flvf1924-2-i. Of course, this is a simple 1 layer NN. The difference might only show up for more complex model.

From the Xilinx, the latest version of Vivado_HLS is v2017.4.1. We should consider switch to this latest version.

nhanvtran commented 6 years ago

whoa @zhenbinwu -- very awesome! this is a relief that we won't have to explain this -- maybe it was an interesting optimization in just 2016.4

are you running on rulinux04? let me try to get that installed

nhanvtran commented 6 years ago

by the way i want to mention how beautiful this plot is! it's kind of what we were thinking at the very beginning, that you would have these plateaus at different precisions. before we had more of a monotonic rise, but these are real plateaus. it would be interesting to make a log(y) version of this plot...to see what's happening below <12,4>

violatingcp commented 6 years ago

Do you have the plot where you multiply by reusfactor? It would be cool to see the scaled overhead?

On Fri, Feb 23, 2018 at 1:56 PM, nhanvtran notifications@github.com wrote:

by the way i want to mention how beautiful this plot is! it's kind of what we were thinking at the very beginning, that you would have these plateaus at different precisions. before we had more of a monotonic rise, but these are real plateaus. it would be interesting to make a log(y) version of this plot...to see what's happening below <12,4>

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/hls4ml/issues/31#issuecomment-368105249, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4TOk8PsYnTb5Mb0RZ_UYVijC4Ui7ks5tXwnVgaJpZM4RSVRj .

benjaminkreis commented 6 years ago

This is interesting even for reuse=1. In 2016.4, we saw that the number of DSPs per multiplier varied for a given precision choice. I believe this is because HLS would reduce the number of DSPs per multiplier to what it could given the weight values at the precision choice. So for example, with <28,4>, we got a mix of 2 DSP and 4 DSP multipliers.

Now it looks like HLS isn't doing this anymore. I see the number of bits of the weight ports vary still (maybe by less??), but the number of DSPs per multiplier always stays the same for a given precision choice (ie regardless of the weight value). So, we always see: 2 DSPs per multiplier for <24,4> 3 DSPs per multiplier for <26,4> 4 DSPs per multiplier for <28,4>

Then for reuse>1, it makes sense that you don't have spikes because you don't have the mix of DSPs per multiplier anymore. (To be sure, we'd have to look at the number of bits of the weight ports, more vhdl, etc. to be more sure it's reusing every multiplier rather than sometimes creating a new one sometimes, but probably not worth digging deeper.)

I wonder if this version is better at aligning zero weights, improving the compression.

benjaminkreis commented 6 years ago

P.S. This leads to higher DSP usage in some cases. For example, for reuse=1 and <24,4>, it goes from 641 DSPs to 704 DSPs. For <28,4>, it almost doubles, going from 734 to 1408 DSPs.

In the first case, it's the mix of 1 and 2 DSP multipliers changing to all 2 DSP multipliers. In the second case, it looks like HLS decided to do <28,4> with all 4 DSP multipliers instead of mostly 2. Interesting!

zhenbinwu commented 6 years ago

@nhanvtran Yes, I think the 2017 plot is easier to understand.

@violatingcp @benjaminkreis The DSP*Reuse plots are interesting. dsp reuse_v2016 dsp reuse_v2017

I have been expecting similar DSP*Reuse, as the 2017 version. But the 2016 plot is quite different.

ejk43 commented 6 years ago

Wow, fascinating results! I would never have predicted this behavior. I would have to believe this is a deliberate change... Maybe it adds better stability overall, even if it doesn't squeeze out all possible optimizations at certain bidwidths?

Guess I ought to use 2017+ :)

benjaminkreis commented 6 years ago

@jngadiub and I were just chatting a bit more about the precision scans. Right now we are scanning so high in precision that precision of the python write function matters. For example, the first weight of the 1 hidden layer model is written to w1.h as 0.234676495194, but if I crank up the precision of the write (as in this branch) we get 0.23467649519443511962890625000000.

We see nominal performance above ~<12,4>. So that would be an argument for just not scanning the performance so high in precision.

For DSP usage, it's not at all clear to me how this fits in. The truncation in the write could probably lead to a bigger or smaller number of bits in binary. The translation is not so straightforward, e.g. 0.1 in binary is 0.00011(0011) repeated forever).

nhanvtran commented 6 years ago

good point -- should we just stop at 32 bits? is that what you're implying?

greater than 32 bits isn't even reasonable from the offline training point of view necessarily

benjaminkreis commented 6 years ago

not sure exactly where to stop. i guess 32 bits (or 64?) is used in keras, but that's floating point, not fixed point.

pierinim commented 6 years ago

What was used for training? Most likely a single-precision GPU. So I don't think that going to 64 makes sense.

benjaminkreis commented 6 years ago

Hi @ejk43, following up on the op/s formula you mentioned above (" For a ballpark "op/s" number on the HLS output, you'd want to plot M*200e6/II, where M = number of DSPs (slide 3) and II is the interval (slide 5)") --

If I imagine 2 inputs being multiplied in parallel for a precision that uses 1 DSP per multiplication, ops/s = 2 DSPs*200e6/1. = 400e6. And if we did them serially, it's 1DSP*200e6/2 = 100e6. That's a factor of 4 different (or in general Reuse^2).

What's the intuition for what this is quantifying? If you asked me the number of multiplications finishing per second, I would say parallel=2*200e6/1 and serial=2*200e6/2, a factor of 2 different (in general a factor of Reuse). Is it related to that?

Do you have a recommendation for how to report this to others? Can we neglect fixed point addition from what counts as an operation, or should we add a caveat about this?

ejk43 commented 6 years ago

Well gee, the original equation is definitely wrong and not what I had in mind... My thought actually should have been to evaluate MclkII, which demonstrates the number of raw DSP48 operations/second. The interesting takeaway here would be that the fpga has a physical maximum number of DSP operations per second (# total DSPs * clk rate), and each implementation you're testing achieves some percentage of the theoretical max. This would roughly represent the device utilization. Though on further thought, even this equation does not quite represent what I'm looking for because it's plausible that there's an II of 3, for example, but the DSPs might only be used during 2 clocks. So I'm not quite sure where that equation was going. Any way, my goal would be to demonstrate patterns in whatever utilization HLS decides to allocate for various parameters, and how to achieve the highest utilization... But perhaps this is not very useful overall (or only interesting to me, haha). Hope my original flawed equation didn't cause too much heartburn!

A potentially more useful extension would be to represent the number of multiplies per second exactly as you have described, also taking into account the fixed point bit widths (for example, bit width of 24 would require 2x the DSPs but keep the same total multiplies per second)... I could be convinced otherwise, but I probably think the bit width is an important variable to keep in the analysis in order to demonstrate the precision vs throughput comparison (eg "you get X% precision improvement using 32 bits, at a cost of Y% throughput loss in total operations per second")

On a bigger picture, what sort of goals do you have in mind here overall?

zhenbinwu commented 6 years ago

We tried to understand the unit Gops, or the definition of operation. From the Intel WP, the operation is defined as either multiplication or addition with floating-point precision. The WP also pointed out the peak floating-point computing capabilities heavily depends on the choice of PFGA family and the implementation. In our case, since the NN doesn't use the full DSP/LC of the FPGA, we don't get a great Gops comparing with other articles.

Another figure of merit we found interesting is the Gops/Watt. It has been shown as one important advantage of FPGA over GPU here.

screen shot 2018-04-05 at 10 00 54 pm

Xilinx has a paper of Deep Learning, in which shows 180 Gops/Watt for KU115. screen shot 2018-04-05 at 10 01 44 pm

We are still trying to figure out what is the best way to calculate the Gops/Watt. Could we assume the number of DSP is the number of floating-point addition and multiplication? How do we correctly factor in the latency and initial interval into the formula? Thanks!

nhanvtran commented 6 years ago

@ejk43 -- I think the bigger overall goal here is to put our results into metrics that others might find interesting for the paper. If the current plots already are sufficient for the CS/engineering field, that's also acceptable :).

Otherwise, if we want to have a little more useful metric -- we better make sure we are clear on the definition.

nhanvtran commented 6 years ago

this was a really nice thread, but i think it's time to retire it.

fastmachinelearning / hls4ml

Performance variation with input configs #31