fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.27k stars 409 forks source link

RNN/LSTM in HLS Library #13

Closed ejk43 closed 3 years ago

ejk43 commented 6 years ago

Fill out the plausible types of network layers:

What else am I missing?

nhanvtran commented 6 years ago

This isn't necessarily a layer type but one thing that we definitely want to have is a regression example. This is important for fitting/determining physics parameters and already are many use cases.

ejk43 commented 6 years ago

What do you mean by regression example?

I'm happy to start working on these new layer implementations. My typical process here is:

  1. Start with a tensorflow example, which I'd create and train
  2. Implement the functionality of the TF function in C++ code. Confirm equivalence using floating point types
  3. Edit typing and compiler directives to achieve a decent resource target in HLS

If you have concrete examples ready (for example, something like this article here: https://www.tensorflow.org/tutorials/recurrent), that could save some time on Step 1 and help us be clear about what exactly we'd like to implement. Usually I just try to implement the tensorflow function (matmul is a single fully-connected layer. I targeted tf.nn.conv1d for the 1-dimensional convolution. It looks like LSTM would be what we need for recurrent NN). I'm not too familiar with Keras yet, but I'd be happy with examples there too. Thoughts?

By the way, the RNN may be tricky in HLS-- the compiler tends to choke on feedback like that. But we'll see.. should be able to make it work

nhanvtran commented 6 years ago

Sorry still learning :)... @jmduarte is traveling but usually sets us (me) straight on these things. So nothing new for the regression is needed (just the linear activation @benjaminkreis put in) -- so we just want to add an example weight file?

In any case, on the examples, we plan to have a repo for setting up the training too with common data files, etc. -- I put this under the team "Project" tab. @jmduarte is working on the Keras training so that at least could be common (hopefully in the next few days), but it is also in the Issues to include a TensorFlow implementation as well.

We'll try to get you data files asap but starting with anything is totally fine. Starting to think about a RNN would be great! We still didn't port over your CNN example either...

ejk43 commented 6 years ago

I'm learning too! I still dont know what exactly you mean by the regression..? Does this refer to a logistic regression?

Keras implementation would definitely be fine as a reference, if/when that's available.

We still didn't port over your CNN example either...

Convolution was not fully fleshed out, anyway! I only implemented a small subset of possible 1D convolutions. This needs lots of work :)

nhanvtran commented 6 years ago

Yes, basically we were discussing something like this https://deeplearning4j.org/logistic-regression

Typically we have been working with examples where we want to classify or tag a something as a something or something else. But another great use case would be to give some raw inputs and get out the momentum of the muon or the energy of a cluster or soething like that.

So I was thinking to have an example like this -- though from the technical point of view, it looks like we have all the ingredients already, it's more in how you do the training.

On CNN: Ok, thanks for the clarification...both CNN and RNN/LSTM are on the list. People (with your help @ejk43) are definitely interested in helping with this

ejk43 commented 6 years ago

Yes, basically we were discussing something like this https://deeplearning4j.org/logistic-regression

Cool! On review-- isnt this just a single matrix multiply? I feel like we could do this logistic regression now by setting biases to 0 and not using an activation function. Is that correct, or am I missing something?

ejk43 commented 6 years ago

On the topic of LSTMs:

I'm thinking I'll start with the (relatively) low-hanging fruit of normal RNNs-- just a single tanh in the hidden layer! In the big picture though, I'm still pretty concerned about the feedback cycle for the fully-parallel implementation. HLS is likely to put up a good fight trying to hit II=1, and depending on the amount of logic for the RNN/LSTM, timing and loop closure can become very difficult to hit low IIs. In the extreme case, there might be a scenario where a RNN hidden layer computation from timestep k is NOT available at timestep k+1, but rather available at timestep k+4 or something due to pipelined delays through the logic. This is in fact a very hairy scenario to tackle with HLS because it likes to synthesize exactly what the C++ code describes; so, we may need to explicitly program this latency delay. I'll think about it, but in my recent experience (my coworker and I basically fought problems like this for a large part of November... HLS FML), it's a challenge.

violatingcp commented 6 years ago

Hi EJ

Thanks for the pointers. I am trying to go through some simple examples as well. I haven't found anything better. I would really like to find a low-level RNN where they write the code from scratch. The LSTM in keras is a bit cryptic. Also, I am a bit worried about the feedback as well, we might invetably have to take a massive latency hit. I wonder if we can rework the computation to make it more Dense layer like. I know there are some equivalences out there.

Here is a paper I found on latency reduction in GPUs : https://arxiv.org/pdf/1706.00878.pdf

I haven't gone through in detail, but I think this is a drection to get the latency down.

Phil

Le 6 déc. 2017 10:34 PM, "EJ Kreinar" notifications@github.com a écrit :

On the topic of LSTMs:

-

I'm trying to get some very basic LSTM examples running, along with a functional implementation (equations, pseudocode, etc). Any good resource suggestions?

I found a few online introductions to get up to speed. A nice 20-line Keras example here, with conv1d, maxpool, LSTM, and dense layers: https://machinelearningmastery.com/sequence-classification-l stm-recurrent-neural-networks-python-keras/ https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ ; some intuitive explanations here: http://colah.github.io/posts/2 015-08-Understanding-LSTMs/ ; and an arxiv paper with a condensed version of the equation for LSTM, among others types of RNNs: https://arxiv.org/pdf/1506.02078.pdf https://arxiv.org/pdf/1506.02078.pdf

I'm thinking I'll start with the (relatively) low-hanging fruit of normal RNNs-- just a single tanh in the hidden layer! In the big picture though, I'm still pretty concerned about the feedback cycle for the fully-parallel implementation. HLS is likely to put up a good fight trying to hit II=1, and depending on the amount of logic for the RNN/LSTM, timing and loop closure can become very difficult to hit low IIs. In the extreme case, there might be a scenario where a RNN hidden layer computation from timestep k is NOT available at timestep k+1, but rather available at timestep k+4 or something due to pipelined delays through the logic. This is in fact a very hairy scenario to tackle with HLS because it likes to synthesize exactly what the C++ code describes; so, we may need to explicitly program this latency delay. I'll think about it, but in my recent experience (my coworker and I basically fought problems like this for a large part of November... HLS FML), it's a challenge.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/HLS4ML/issues/13#issuecomment-349853039, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4VGJBkfFoIv_LJnnYCJhhx01FHFAks5s91zAgaJpZM4QX4Sx .

violatingcp commented 6 years ago

Hi EJ,

Here is an RNN simple example : https://github.com/pangolulu/rnn-from-scratch

I will have a look at it in the next few days, but this seems like a place to start.

Phil

On Thu, Dec 7, 2017 at 6:30 AM, Philip Harris <Philip.Coleman.Harris@cern.ch

wrote:

Hi EJ

Thanks for the pointers. I am trying to go through some simple examples as well. I haven't found anything better. I would really like to find a low-level RNN where they write the code from scratch. The LSTM in keras is a bit cryptic. Also, I am a bit worried about the feedback as well, we might invetably have to take a massive latency hit. I wonder if we can rework the computation to make it more Dense layer like. I know there are some equivalences out there.

Here is a paper I found on latency reduction in GPUs : https://arxiv.org/pdf/1706.00878.pdf

I haven't gone through in detail, but I think this is a drection to get the latency down.

Phil

Le 6 déc. 2017 10:34 PM, "EJ Kreinar" notifications@github.com a écrit :

On the topic of LSTMs:

-

I'm trying to get some very basic LSTM examples running, along with a functional implementation (equations, pseudocode, etc). Any good resource suggestions?

I found a few online introductions to get up to speed. A nice 20-line Keras example here, with conv1d, maxpool, LSTM, and dense layers: https://machinelearningmastery.com/sequence-classification-l stm-recurrent-neural-networks-python-keras/ https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ ; some intuitive explanations here: http://colah.github.io/posts/2 015-08-Understanding-LSTMs/ ; and an arxiv paper with a condensed version of the equation for LSTM, among others types of RNNs: https://arxiv.org/pdf/1506.02078.pdf https://arxiv.org/pdf/1506.02078.pdf

I'm thinking I'll start with the (relatively) low-hanging fruit of normal RNNs-- just a single tanh in the hidden layer! In the big picture though, I'm still pretty concerned about the feedback cycle for the fully-parallel implementation. HLS is likely to put up a good fight trying to hit II=1, and depending on the amount of logic for the RNN/LSTM, timing and loop closure can become very difficult to hit low IIs. In the extreme case, there might be a scenario where a RNN hidden layer computation from timestep k is NOT available at timestep k+1, but rather available at timestep k+4 or something due to pipelined delays through the logic. This is in fact a very hairy scenario to tackle with HLS because it likes to synthesize exactly what the C++ code describes; so, we may need to explicitly program this latency delay. I'll think about it, but in my recent experience (my coworker and I basically fought problems like this for a large part of November... HLS FML), it's a challenge.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/HLS4ML/issues/13#issuecomment-349853039, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4VGJBkfFoIv_LJnnYCJhhx01FHFAks5s91zAgaJpZM4QX4Sx .

ejk43 commented 6 years ago

Since we discussed RNN and LSTMs last month, I've had some more experience with feedback in HLS... Overall, it's a tremendous pain. There's two big pain points:

  1. The feedback path is NOT handled well by HLS synthesis at all. The network latency becomes coupled with the initiation interval (because all computations need to finish before the next computation can begin).

    • It's very difficult to break this feedback path. I found some minor success by explicitly creating registers inside the HLS model to "latch" internal values and delay until the next iteration. If we were to try to apply this architecture to a LSTM, the equivalent behavior would be that the output of Layer(n) does NOT get input to layer(n+1); it would be applied to layer(n+delay).
  2. Often, a HLS model with feedback synthesizes correctly, and may even simulate correctly in an HDL testbench, but when implemented on an FPGA, the output results are incorrect.

    • This is pretty inexplicable behavior. I keep trying to think that it's simply user error, but both myself and my coworker ran into the same problems with two completely different HLS models.
    • Our solution was to keep everything else the same, but connect the feedback path outside the HLS model. This caused the "delay" in step 1 to equal the latency, but HLS then synthesized well, met the target II, and ran correctly on hardware. We just had a larger feedback delay, which needs to be accounted for in the digital control loops.

So perhaps two plausible approaches:

violatingcp commented 6 years ago

Hi EJ,

Thanks for your feedback. I completely that RNN/LSTMs are a bit of a pain. That being said, its probably worth giving a stab. One more thing is that this has been done on an FPGA. Although this was done very recently with what appears to be a team of experts (including song) : https://arxiv.org/pdf/1612.00694.pdf

If you read the paper, they just shove the whole thing into the FPGA and do whatever speedup they can through parallelizing the matrix multiplication. Incidentally, GPUs will not help much for the parallization either.

On Wed, Jan 3, 2018 at 9:22 AM, EJ Kreinar notifications@github.com wrote:

Since we discussed RNN and LSTMs last month, I've had some more experience with feedback in HLS... Overall, it's a tremendous pain. There's two big pain points:

1.

The feedback path is NOT handled well by HLS synthesis at all. The network latency becomes coupled with the initiation interval (because all computations need to finish before the next computation can begin).

  • It's very difficult to break this feedback path. I found some minor success by explicitly creating registers inside the HLS model to "latch" internal values and delay until the next iteration. If we were to try to apply this architecture to a LSTM, the equivalent behavior would be that the output of Layer(n) does NOT get input to layer(n+1); it would be applied to layer(n+delay).

I have been thinking along the same lines. Some component of the application can still be done in parallel since as I understand it the layer to layer propagation is split to application on new inputs and the merging with the passed values. What this means is that at least these parts can be parallelized. However, we will have to wait for the previous layer to complete before we move on unless we do as you say and use a delayed architecture. Perhaps we make this an option, or we should at least study a way to unroll these separate components.

1. 2.

Often, a HLS model with feedback synthesizes correctly, and may even simulate correctly in an HDL testbench, but when implemented on an FPGA, the output results are incorrect.

  • This is pretty inexplicable behavior. I keep trying to think that it's simply user error, but both myself and my coworker ran into the same problems with two completely different HLS models.
    • Our solution was to keep everything else the same, but connect the feedback path outside the HLS model. This caused the "delay" in step 1 to equal the latency, but HLS then synthesized well, met the target II, and ran correctly on hardware. We just had a larger feedback delay, which needs to be accounted for in the digital control loops.

This sounds like a bug in HLS, more than anything, perhaps a deep one. Do you have a feeling for whether this will be fixed in the long run, or should we just always resort to this sort of behavior. Also, maybe I dont' understand exactly what you say, but is this just a feature of pipelining in for loops? I mean what if we just keep throwing the updated values to new resources?

1.

So perhaps two plausible approaches:

-

If we want the lowest possible II, there's not really any option except to break the feedback path. Which means the LSTM network training would need to accommodate this increased computational delay between layers. Does anyone know of any progress in this area at all? I cant imagine it's supported in your off-the-shelf Tensorflow or Keras training models.

I am pretty sure that we could add this in to tensorflow. I don't see why delayed inputs would be much cause for concern . In fact I think I know how to do this (kind of painfully)

-

-

Perhaps, if LSTMs are run only on the output of some downsampled network layer, the LSTM input may be able to seriously relax the II requirement? It's not an "optimal" solution, but it's a realistic way to get a first cut HLS model working

I think for now this is an option. Its not clear where we would want to use LSTM. Also, my feeling is that if we present a real limitation of HLS/Algo in general it might spark more interest.

-

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/HLS4ML/issues/13#issuecomment-355023288, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4XGZ196RbpXEOHdpcwikjD9TnCfVks5tG405gaJpZM4QX4Sx .

ejk43 commented 6 years ago

I think for now this is an option. Its not clear where we would want to use LSTM. Also, my feeling is that if we present a real limitation of HLS/Algo in general it might spark more interest

Agreed-- I think we should just give RNNs/CNNs a shot now and see where it starts.

@violatingcp Were you able to make any progress on the "RNN from scratch" investigation you linked earlier? I havent had a chance to look at this at all yet.

This sounds like a bug in HLS, more than anything, perhaps a deep one. Do you have a feeling for whether this will be fixed in the long run, or should we just always resort to this sort of behavior.

It's worth a shot to try feedback in HLS, but as far as I can tell it may be a long-running problem in HLS, which shows up from 2015.4 to 2017.3 (havent tested earler or 2015.4 or later than 2017.3). We'll want to make sure to 1) run the self-checking HDL simulation in the HLS workflow and 2) create unit tests for fabric implementations. Item 1 is run through the HLS interface or command line after synthesis-- and I've found it can highlight important problems with the synthesized output that do not always show up in simulation or synthesis.

Also, maybe I dont' exactly what you say, but is this just a feature of pipelining in for loops? I mean what if we just keep throwing the updated values to new resources?

Right, sure, if we wanted to create multiple instantiations of the LSTM, that would likely work fine with the HLS tools. If I understand what you're suggesting, the thought would be to simply duplicate the LSTM unit N times in fabric, so the output of LSTM 1 goes to LSTM 2, rather than back into LSTM 1. That's a nice brute-force approach to trade off resources vs throughput.

@nhanvtran On the topic of other layer types, I had a first cut at maxpool in my original repo. Are you interested in pulling maxpool into this repo? It's probably no longer compatible out of the box since you all fixed conv1d :) , so probably needs some updating. It might be some low-hanging fruit (knock on wood)

benjaminkreis commented 6 years ago

I'm trying to cook up a simple example of the data flow @violatingcp showed in his slides today.

Does this simple project capture it? https://github.com/benjaminkreis/TestHLS

Specifically: https://github.com/benjaminkreis/TestHLS/blob/master/src/simple_algo_flow.cpp#L29-L31

This has a non-negligible latency but an II of 1.

ejk43 commented 6 years ago

I feel like this is close, but I'm a little concerned about the fact that A1_out and B1_out go directly into the next node.

I'm not sure exactly why you see the II = 1 here... maybe the multiplications are small? But, in essence, this code is telling HLS that the "node" operations MUST happen serially, ie node1 must run first, then node2 must run using the output of node1, and so on.

I've had some success with a model like this...

    static A1_out = 0;
    static B1_out = 0;
    static C1_out = 0;
    node(C1, B1_out,  C1_out);
    node(B1, A1_out,  B1_out);
    node(A1, dummy, A1_out);

On the first call, A1_out/B1_out/C1_out are assigned to 0. On subsequent calls, the values will remain from the previous iteration. This is explicitly telling the HLS compiler that there is not a dependency between calls to the operation, and in theory the node operations may be instantiated in parallel (in practice, there may be more hoops to jump through). I've successfully re-ordered operations this way to remove dependencies for multiplies/adds, but I cant remember if I've successfully parallelized entire functions in such a way. This may be the point where HLS starts to break down.

This static variable paradigm is basically what I had in mind for a first pass at the RNN/LSTM, which will let you save state between calls to the function. It's a pretty common practice (see pg. 136-138, for example: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/ug902-vivado-high-level-synthesis.pdf)

benjaminkreis commented 6 years ago

Thanks for the quick feedback! Indeed in my little test project, I intend for the nodes to be run serially. Basically, I'm trying to do @violatingcp's Slide 14 from today, where the latency is proportional to the number of inputs (so you've given up on fully parallel, but II=1 is preserved).

I'm still trying to wrap my head around your feedback approach, but I think I'm starting to get the general idea. Are you saying that HLS thinks it can run the three node calls in parallel because they look independent, but they are actually linked through the static variables? That's a pretty clever idea! Then I suppose you can't ignore what is happening inside the "node" function, because the details in there will determine whether or not a static variable is set by the other function in time. Is that right?

benjaminkreis commented 6 years ago

Regarding the II being 1, are there multiplications that can't be done with II=1? The IPs I've seen always had II=1 as an option, but maybe they were always small.

I believe that if the node function can have an II=1, then the whole "simple_algo_flow_hw" function can too. To do this, I assume HLS is putting B1 through a FIFO until A1_out is ready and C1 through a FIFO until B1_out is ready (*). I think this should be possible in general until the number/size of inputs and node latency get so big that you run out of resources to do the buffering.

(*) similarly on the output side, A1_out and B1_out have to be buffered until C1_out is ready

ejk43 commented 6 years ago

I believe that if the node function can have an II=1, then the whole "simple_algo_flow_hw" function can too. To do this, I assume HLS is putting B1 through a FIFO until A1_out is ready and C1 through a FIFO until B1_out is ready (*). I think this should be possible in general until the number/size of inputs and node latency get so big that you run out of resources to do the buffering.

Oh yeah-- I just looked at this again and I think your interpretation could be right. Very interesting idea! I'm curious if this would work with a more involved example-- the sort of architecture you set up here usually leads me towards a critical path that breaks the II=1 constraint, but it definitely makes sense that you could essentially adopt the II of the node function.

It seems like a similar idea of FIFO buffering would be what you're basically looking for regarding the "repeated" RNN -- maybe calling the node function inside a for-loop instead?

Regarding the II being 1, are there multiplications that can't be done with II=1? The IPs I've seen always had II=1 as an option, but maybe they were always small

You can definitely hit II=1 for multiplies. But if you put a matrix multiply instead of a single multiply, there will be some additions that need more clocks and breaks the II on the node function. I wonder if HLS would be smart enough to duplicate more logic to re-achieve II=1? I'm guessing not, but I would be happy to be suprised here!

violatingcp commented 6 years ago

Hi guys,

I started to play around with Ben's example to understand latency. I changed the very simple multiplication to vector dot products and I passed information in a few different ways. I have a few observations :

  1. The latency usage scale linearly with the number of calls to node
  2. The DSP usage scales linearly with the number of calls to the node
  3. The Pipeline interval and the latency is worse if you do a 2 parameter array vs a 1 parameter array.

https://github.com/violatingcp/TestHLS/blob/master/src/simple_algo_flow.cpp#L19 vs https://github.com/violatingcp/TestHLS/blob/master/src/simple_algo_flow.cpp#L56

  1. The vector multiplication is unrolled in the functions and setting (as EJ pointed out)
  2. If I pass a 2D array, I am limited to a minimum Pipeline interval of 2 whereas a 1D array I can pipeline with 1
  3. v1 and v2 functions behave exactly the same (DSP and latency wise)

https://github.com/violatingcp/TestHLS/blob/master/src/simple_algo_flow.cpp#L95 vs https://github.com/violatingcp/TestHLS/blob/master/src/simple_algo_flow.cpp#L56

I think everything makes sense, these are good things to know. I will start getting to the real RNN implementation now. The lowest latency one should be straightforward. The one thing that we might be able to do better is 2. Can we feed the same DSP 2 different numbers on different clocks? Is there a limitation to this?

Thanks, Phil

On Sun, Feb 11, 2018 at 8:31 PM, EJ Kreinar notifications@github.com wrote:

I believe that if the node function can have an II=1, then the whole "simple_algo_flow_hw" function can too. To do this, I assume HLS is putting B1 through a FIFO until A1_out is ready and C1 through a FIFO until B1_out is ready (*). I think this should be possible in general until the number/size of inputs and node latency get so big that you run out of resources to do the buffering.

Oh yeah-- I just looked at this again and I think your interpretation could be right. Very interesting idea! I'm curious if this would work with a more involved example-- the sort of architecture you set up here usually leads me towards a critical path that breaks the II=1 constraint, but it definitely makes sense that you could essentially adopt the II of the node function.

It seems like a similar idea of FIFO buffering would be what you're basically looking for regarding the "repeated" RNN -- maybe calling the node function inside a for-loop instead?

Regarding the II being 1, are there multiplications that can't be done with II=1? The IPs I've seen always had II=1 as an option, but maybe they were always small

You can definitely hit II=1 for multiplies. But if you put a matrix multiply instead of a single multiply, there will be some additions that need more clocks and breaks the II on the node function. I wonder if HLS would be smart enough to duplicate more logic to re-achieve II=1? I'm guessing not, but I would be happy to be suprised here!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/HLS4ML/issues/13#issuecomment-364810588, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7V4fEVnyoFB1mHHDJYXbyXbucpWj16ks5tT5R_gaJpZM4QX4Sx .

ejk43 commented 6 years ago

Phil, wow! Great info. I like that the for loop implementation in v2 is equivalent to v1.

Remember to use hls partition dim=0 if you want to "fully" partition the array across all dimensions. You might get better results for your item 5.

The fact that you're seeing linear resource usage with number of calls to the node function tells me that this approach is working as you and Ben intended. Nice! I like this a lot.

Towards the RNN effort, I put together a primitive RNN function this weekend (just a test of the floating pt algorithm implementation and using a static variable for state memory): https://github.com/hls-fpga-machine-learning/HLS4ML/pull/43

It's designed for a different use-case, where there's only one RNN cell and the memory is saved internally-- I don't believe this would operate like your "node" function yet, since I'm not exposing the state variable out to the calling function. Just wanted to point this out as a potential starting point or inspiration.

nhanvtran commented 6 years ago

renamed this issue to be more descriptive

jmduarte commented 3 years ago

Addressed by #329