eddelbuettel / rcppcnpy

Rcpp bindings for NumPy files
GNU General Public License v2.0
26 stars 16 forks source link

"Feature" of transposed ndarrays #28

Open JaySLee opened 4 years ago

JaySLee commented 4 years ago

Dear Dirk,

Firstly, thanks so much for this package!

So, I think I found an undocumented feature for npyLoad(), in RcppCNPy version 0.2.10 (latest version?). If I npyLoad a transposed numpy.ndarray, then the default dotranspose=TRUE results in an improper R array and dotranspose=FALSE is required. Alternatively, one can do a .copy() before numpy.save()-ing the array. I couldn't find anything in your documentation that addresses this.

Best, Jay

eddelbuettel commented 4 years ago

Hi Jay,

That aspect has mostly been a headscratcher for me over the years. If you can think of a programmatic way to detect the need for either setting, or explicit transposes, can you maybe think of a PR?

Otherwise, can you suggested a line or two to describe this for documentation?

I don't work with NumPy data so it is simply not something I hit often.

JaySLee commented 4 years ago

Hi Dirk,

After some investigation, I found that numpy arrays have a flag as to whether it is C-contiguous (row major) or Fortran contiguous (column major): arr.flags, arr.flags.c_contiguous, arr.flags.f_contiguous . The f_contiguous flag becomes the 'bool fortran-order' in the cnpy structure (cnpy.h). As you pointed out in your vignette, numpy arrays by default are row major, which is C-contiguous, but you stated Fortran for Python and C for R, when it should be the other way around (but the description in the vignette seems correct).

When a numpy array is transposed (without being copied afterwards), the flags switch such that c_contiguous = False and f_contiguous = True. So, one solution would be to modify cnpyMod.cpp so that it checks arr.fortran-order (in npyLoad()) and transposes accordingly (i.e. if (arr.fortran-order == 0) ...). This could mean getting rid of the dotranspose argument. However, maybe you'd want to leave that arg in for users to better control the loading. So then, one could add an autotranspose arg (boolean) to npyLoad() that overrides whatever dotranspose is set to.

Alternatively, for the documentation, we could add that numpy arrays that are transposed in Python (via arr.T or arr.transpose()), but not copied or treated with the numpy.fastCopyAndTranspose(arr), will require npyLoad(dotranspose = False). I am unsure at the moment if Python's arr.reshape() has the same effect.

I can suggest a pull request depending on which one you want to go with.

Best, Jay

eddelbuettel commented 4 years ago

Thanks for the analysis!

First paragraph: I do not understand

As you pointed out in your vignette, numpy arrays by default are row major, which is C-contiguous, but you stated Fortran for Python and C for R, when it should be the other way around (but the description in the vignette seems correct)..

Can you maybe expand to two or three sentence and state what is written, what is correct or incorrect, and what you suggest should be written and why?

We only have bool fortran_order. Are you suggesting a consistency check between the dotranspose argument and the fortran_order value? That may make sense. Right now, I do not recall why I don't use fortran_order. Maybe I found it to be inconsistently set... git blame shows it was added 16 months later. Maybe it simply never was used correctly, and you can help as you can create test data.

Second paragraph: Are c_contiguous and f_contiguous NumPy values, so we do not have access to them?

Third paragraph: I am unsure what you are suggesting here.

I think we are not quite a stage where we know what a pull request should do but I do really appreciate your digging into this.