Permit different array strides for fillsinks

wkearn commented 5 months ago

fillsinks currently assumes that arrays are stored in column-major order: the dimension with size nrows should vary the fastest. This follows MATLAB, Fortran and Julia practice but the default NumPy array storage is column-major, and we need to copy NumPy arrays to a new storage order to ensure that the morphological reconstruction is done correctly. Column-major order is not a strict requirement of the morphological reconstruction algorithm, which only needs the strides of each dimension to compute the neighborhood of a given pixel. If we explicitly pass in the strides, we should be able to operate on row- or column-major data and on subsets or views of data without requiring copies.

wschwanghart commented 5 months ago

What exactly do you mean by strides? The number of rows and columns?

wschwanghart commented 5 months ago

It seems that this column major issue will likely have quite some implications, thinking of the edgelists in FLOWobj. Is there a way that a c function knows the order of the array supplied by the caller function?

wkearn commented 5 months ago

The stride of an array in a particular dimension is the offset in memory required to increment that dimension by one. It is how you convert from the subscripts to the linear indices for a given array.

For a column-major array, used by MATLAB, Fortran or Julia among others, you would index an array of size (nrows,ncols) at (row,col) like

array[row + nrows * col];

The strides corresponding to each dimension are (1,nrows) because you increment the linear index by 1 to go to the new row and by nrows to go to the next column.

For a row-major array, which is used by NumPy and by C's multidimensional arrays, you would access the same cell with

array[ncols * row + col];

so the strides are (ncols,1). The advantage of passing the strides explicitly is that you can write something like

array[istrides[0] * row + strides[1] * col];

and it works regardless of whether you are using a column-major or a row-major array. MATLAB or Julia would supply strides = [1,nrows] and Python could supply strides = [ncols,1].

In this example, the strides happen to be the same as the number of rows or columns in each dimension, but in the other order. They don't have to be, though. You could imagine operating on a subset of an (nrows,ncols) array: the number of elements that we would scan over in each dimension is less than nrows and ncols, but the strides are still either (ncols,1) or (1,nrows). This might come in handy if, as you mentioned to me earlier today @wschwanghart, you wanted to analyze a single watershed in a DEM: you could extract just the bounding box of the watershed and run the analysis without needing to copy the subset to a new array.

This does have some implications for the design of the edge lists because the linear index used to identify a pixel in a DEM depends on whether the array is laid out in column- or row-major order. A C function does not implicitly know the order of whether an array is column- or row-major: all it sees is a linear array of bytes. One could pass a flag to the function that alerts it to the storage order (NumPy actually enables this for a few of its functions), but, as described above, the strides are really the important thing to know. The subscript index (row,col) is the same for each corresponding pixel in your array, regardless of its storage order. The linear index is always strides[0] * row + strides[1] * col. If we keep the strides around, then the linear index uniquely determines the corresponding pixel, and we should be able to keep using the linear indices as node identifiers.

The problem arises if, for example, you wanted to use an edge list computed from an array with a column-major order to index into an array with the same size but laid out in row-major order. I am not sure how this would happen intentionally. Perhaps if you used TopoToolbox to generate your flow network and another program to generate a derivative product that you want to accumulate along the flow network, and the other program insisted on writing out the data in row-major order. The safest thing to do in this case is to always index into an array with the (row,col) and the strides of that array, converting from the stored linear index if necessary. One could also store the subscript indices in the edge list at the cost of twice as much memory.

The alternative to all of this is to choose one storage order and always convert everything. This simplifies our lives, but has a performance cost when working in environments that except the alternative storage order because you have to copy everything in and out of the preferred storage order.

Note also, that this discussion is entirely about the organization of the DEM in memory and is not related to the relationship between the axes of the array and a projected coordinate system.

TopoToolbox / libtopotoolbox

Permit different array strides for fillsinks #23