Possible Documentation Gap in Scan over Multiple Axes

fjarri / reikna

Pure Python GPGPU library

http://reikna.publicfields.net/

MIT License

164 stars 16 forks source link

Possible Documentation Gap in Scan over Multiple Axes #45

Closed robertmaxton42 closed 6 years ago

robertmaxton42 commented 6 years ago

(Sorry to keep bugging you when work is busy; just let me know when you have a moment)

Scan behaves not-intuitively relative to its documentation, in that it does not specify what is meant by scanning over multiple axes at once. I was expecting axis=[0,1] to be equivalent to a Scan over axis 0, followed by a Scan over axis 1, with the benefit of saving an intermediate, internal call to Transpose; but upon a review of the code this is not the case, as a Scan instead transposes so that the listed axes are adjacent and innermost, flattens them, and scans over the new, merged "axis", which rather surprised me when I saw it in action. It would be helpful if this was mentioned explicitly in the documentation.

Thanks for all the work!

fjarri commented 6 years ago

The thing is, if the axis size is large enough, transposition makes a big difference in performance, and currently you cannot avoid it. But if the axis is only several elements long, it may be possible for non-sequential memory access to be faster than transposition + sequential memory access. Perhaps I should add a no_transpose option or something like that, but that's a different issue.

(Sorry to keep bugging you when work is busy; just let me know when you have a moment)

I will actually be travelling until the middle of September, so my responses will be infrequent, but don't let that stop you, file an issue if you have a problem, I'll get to them eventually :)

robertmaxton42 commented 6 years ago

The thing is, if the axis size is large enough, transposition makes a big difference in performance, and currently you cannot avoid it.

Yeah, I found that out the hard way recently :P. But no, even under that circumstance sequential Scans have an extra call to transpose - because rather than going from (a, b, c) -> (a, c, b) -> (a, b, c) and then (a, b, c) -> (c, b, a) -> (a, b,c), you can go directly from (a, c, b) -> (b, c, a) and then out from there.

... Well, assuming that all permutations of axes take equal time, which I just realized isn't necessarily the case. Whoops.