Open bwintermann opened 3 months ago
I actually just ran into an issue when executing the whole build flow, getting an OSError on imports from other modules. Currently looking into it.
Fixed the bug, which was apparently caused by the multithreaded IPGen step, for which every thread reloaded the module and tried to access the same library. It is now protected by a singleton function to make sure it's only loaded once. Running this on my ResNet50 model has reduced the code generation time from around ~16min to ~1min 45s, 99% of it spent on a call that uses INT8, which is still using the Python implementation.
When building ResNet50 i stumbled upon the issue that due to very large weight tensors, the HLS Codegen step, which internally uses the
array2hexstring
function, was taking a very long time to execute. For tensors of roughly 2 Mio. entries it took in the area of ~30s. When doing it for many layers, this step alone would take ~15min per build, making development in later steps difficult due to low iteration speeds.To speed this process up, I firstly focused on the BINARY datatype case and rewrote the function in C, integrating it via Python's
ctypes
. I also added tests that check the results of randomized input tensors to the original Python implementation.I tested two shapes for the input tensors, one with 64 as the innermost, and one with 2048 as the innermost dimension, both with overall roughly 2 Mio. elements. For both I executed the function 5 times. For the 64 one I got an overall runtime of 237.41s (47.482s per sample) in Python and 2.856s overall runtime (0.571s per sample) for the C function for an estimated speedup of ~83x. For the 2048 one, I got an overall runtime of 232.201s (46.44s per sample) in Python and an overall runtime of 0.115s (0.023s per sample) in C, yielding an estimated speedup of ~2019x, presumably due to lower function call overhead.
In the future I would like to expand this to all
DataType
s and try to speed up the C implementation a bit more as well, but for now I don't think that further speedup is strictly necessary.