Hard coded transform batch size in norm calculation

I suggest the following to solve the problem:

We could take a peek onto the memory size of one vector and determine a better suited chunk size from that figure. This would result in better alignment of the chunks to cache sizes commonly found in modern processor architectures, resulting in more consistent performance.

A defensive approach for that would be to just assume that 1M of cache memory should be available to absolutely most machines around nowadays. Then we determine the stride size from that:

numStrideSize = max(1, 1048576 // np.empty((self.numCols, ), dtype=self.dtype).nybtes)

Since np.empty does not initialize the data section of that vector, the overhead should be neglectible, making it preferrable over hard-to-read explicit lowlevel-architecture bean-counting in this context.

For better adaptability, the magic number 1048576 should go into the flags object, such that this can be controlled by the user, perhaps even initialized from reading out the actual cache sizes.

Opinions? Objections?

EMS-TU-Ilmenau / fastmat

Hard coded transform batch size in norm calculation #86