Closed bo3z closed 5 months ago
Pre-commit for this requires bit more work, as some of the lines are too long and there are also complains on constructor initialiser etc. I will get to it in the next few days. Otherwise ready for review. However, it turns out this doesn't trigger the new PyTests for the Optimization API? Is there a script that needs to be modified to include the newly added files.
This is largely standalone, so I don't have many issues to not commit this. I see the following things left to do:
Why can't I merge main into this? I think it should solve the pytest problems.
Why can't I merge main into this? I think it should solve the pytest problems.
I've merged main into this branch - however, I don't think it will solve the PyTest problems. There are some bugs with the versions of the Python packages to be installed. All the tests pass locally but I haven't been able to recreate the CI/CD flow locally, to further debug the cause of this.
The failing tests have been resolved and this can now be merged. The issue was due to Keras Surgeon - it uses an ancient version of PyTest. As such I have ignored the Keras Sureon test and removed it as a hard dependency but left very clear instructions to anyone wanting to use it on how to install it from GitHub. Anyway, the current (patched) Keras Surgeon is part of the FastML organisation, and, if it turns out there is interest in using it, it can later be fixed to solve the dependency issues.
Alongside #809, both branches should be up-to-date with master. For compatibility and testing sake, there is a branch combining the two PRs into a single branch: https://github.com/fastmachinelearning/hls4ml/tree/hardware-aware-pruning
As a side note, the paper describing the pruning algorithm is on arxiv: https://arxiv.org/abs/2308.05170. However, this is the pre-print version. I will include a link to the IEEE proceedings from FPT (held next week), once available. We can then include it the citation to the README.
I want to run the pytests after the latest force-pushing but am having trouble triggering it.
I am back from vacation. Are there any reasons not to merge this PR?
This pull request introduces the first part of the hls4ml Optimization API - an automated workflow for hardware-aware model compression. By formulating pruning and weight sharing as a linear optimisation problem, the workflow iteratively selects redundant weights, considering the overall impact on hardware. The tool currently supports Keras and QKeras models, as well as various hardware objectives on GPUs (FLOPs) and FPGAs, with a Vivado hls4ml backend (DSP, BRAM, FF). However, the tool is both hardware- and framework-agnostic - most of the concepts readily generalise to other frameworks (e.g. PyTorch) and other hardware (e.g. Quartus backend). This allows end users to write custom objectives (e.g. Quartus latency optimisation), following a similar template.
Furthermore, this tool aims to bridge the gap between other libraries for model compression, such as TensorFlow Model Optimization and QKeras. The tool is directly integrated with QKeras and an updated version of Keras Surgeon, to aid model compression. Finally, this tool provides out-of-the-box support for structured pruning (filters, neurons), as well as gradient based ranking methods.
The exact implementation and motivations are further explained in the attached presentation. Initial results are shown on both classification and regression with various objectives including, sparsity, GPU FLOP reduction, Vivado DSPs and FFs. Since this is a large PR, it is recommended to review the commits one by one, as each commit is self-contained and can be checked out by itself. They are briefly explained below.
Supporting document and presentation
Available at: https://indico.cern.ch/event/1278049/
Type of change
Description
Contributions:
Tests
PyTest
framework. These tests are stored underhls4ml/test/pytest/optimization
. Each test follows a single addition to the framework and are better explained by the individual commits.Implementation Details
Results
Comparison with TensorFlow Model Optimization
The proposed is evaluated on a range of tasks including: jet classification, SVHN classification from the Fast CNNs paper and a Lenet-like model on Fashion MNIST classification. First, the developed library is compared with TFMOT, in terms of unstructured sparsity, across five trials. As seen, the two perform similarly, with hls4ml being significantly better on LeNet.
DSP-level pruning
Secondly, the method is evaluated on a range of reuse factors with strategy set to resource. These results are after full Vivado synthesis. Latency is reported from CoSim, not HLS estimate and it is in terms of clock cycles, reported as min and max. Where the model has been pruned, it was accelerated using "Unrolled Dense" #806. The baseline models are accelerated using the current version of master, 0.7 - 0.7.1. The decrease in latency is likely because unrolled dense uses the
pipeline
pragma, while standard resource usesdataflow
. However, this is fine as pruning also reduces the number of LUT & FF. BM stands for baseline model, quantised to 16 bits (either <16, 6> or <16, 8>, depending on the accuracy) and BP-DSP stands for a model optimised for DSP utilisation, again quantised to 16 bits. BP-MO stands for multi-objective optimisation, targeting both BRAM and DSP utilisation.First, DSP-level pruning is tested - the idea is to verify the effects of "pattern pruning" - pruning all the weights processed by the same DSP as RF varies. This is shown for jet tagging and SVHN, in both cases achieving significant reduction in DSP utilisation. Furthermore, due to the way hls4ml transposed and stores weights in BRAM, BRAM is also likely to reduce (the same way if pruning unstructured, some structures might be removed)
Multi-objective pruning
Next, verify multi-objective pruning - by pruning all the weights stored in the same BRAM (precision was set to 18 bits, due to the 36-bit width of BRAM), one can remove one block of RAM and two DSP for every pruned structured. Results are shown on jet tagging, since streaming CNNs overuse BRAM - however, in the next table, it is shown how this method can apply to LeNet, significantly reducing DSP utilisation and slightly reducing BRAM.![multi_objective](https://github.com/fastmachinelearning/hls4ml/assets/59868635/1edc6d63-8ec5-4e1d-aceb-a1d8ab2828a1)
Heterogeneous multi-objective pruning for fast inference of LeNet
Consider accelerating a LeNet - in its simple form, it is too large to be accelerated fully unrolled, as the dense layers have ~48k and ~10k weights. Therefore, the design is pruned and accelerated heterogeneously - the Conv2D layers have a latency strategy and RF set to 1. The Dense layers have a Resource strategy - the first Dense layer uses a RF of 25 and the second on of 12. The output layer uses Latency strategy and RF = 1. The design is accelerated with <18, 8> precision. The effects of multi-objective pruning are shown in the table below. The algorithm will choose to prune some individual weights (a single DSP in Conv2D layers) and some groups of weights (a single BRAM block and 2 DSPs in Dense layers, depending on the solution of Knapsack problem).
Finally, it is shown how multi-objective pruning can be used to accelerate a general-purpose CNN for fast image classification on a medium-range accelerator card, ZCU102. The latency is reported in clock cycles, and the increase is likely due to the write out of the accelerator card.![Screenshot 2023-06-16 at 14 39 14](https://github.com/fastmachinelearning/hls4ml/assets/59868635/263d5601-6584-4ad3-b428-adf4e96a3f99)
Known limitations
This is the first part of the optimization API, introducing the software and ML-side of things. The second part will focus on hardware-specific implementations and improvements, including:
Checklist
pre-commit
on the files I edited or added.