Some glsl bikesheding - Githubissues

I have locally different script, but maybe not mix non-functional changes.

Overall C4F16 on my hardware is ~33% faster after those changes.

fp16 is self-explanatory, I don't think we need more precision than that, gives nice perf boost.
shared buffer reorder also increases perf
data interleave is somehow risky, I haven't tested it with real weights, but should be fine. Though I don't see much gain from this change, so let's consider this cosmetic.

Please test, if there are no visual regressions :)

Fixes #4

Artoriuz / ArtCNN