I have locally different script, but maybe not mix non-functional changes.
Overall C4F16 on my hardware is ~33% faster after those changes.
fp16 is self-explanatory, I don't think we need more precision than that, gives nice perf boost.
shared buffer reorder also increases perf
data interleave is somehow risky, I haven't tested it with real weights, but should be fine. Though I don't see much gain from this change, so let's consider this cosmetic.
Please test, if there are no visual regressions :)
I have locally different script, but maybe not mix non-functional changes.
Overall C4F16 on my hardware is ~33% faster after those changes.
Please test, if there are no visual regressions :)
Fixes #4