Open maltanar opened 4 years ago
For my test case using NUM_DEFAULT_WORKERS = 30 and StreamingFCLayer_Batch with "mem_mode" = "decoupled", these are the ones that take more time and I think are well suited for cache support:
HLSSynthIP
(1:40 hours) CompileCppSim
(~3 min) PrepareIP
(23 min, this is still sequential and could easily be parallelized as done for PrepareCppSim
), PrepareRTLSim
(8 min) PrepareCppSim
(~3 min)I would start with PrepareIP
and HLSSynthIP
(needs cache support inPrepareIP
).
To speed up the compilation process for large models or large layers, it would make sense to have a caching mechanism for long-running transformations. The cached outputs would be persistent and get reused when appropriate when a transform is called. The cache generation/reuse should be optional.
The idea would be to generate a hash for all relevant input data for a transform for a particular node (including node attributes, parameter tensors, quantization annotations...) and use that hash as the cache key for the output products in a folder in a persistent location. Later on, if the same transformation is executed on a node with the same hash, the cached outputs can be reused by copying from the persistent cache folder into a new folder.
For
NodeLocalTransform
the hashing is relatively straightforward as only the node itself is passed to the transformation. For others, it's hard to generalize, and best considered on a case-by-case basis that gives the most execution time benefits for the use cases we have.@quetric @Tobi-Alonso do you have any suggestions for which transforms to start with to get the most benefit, or any other comments?