What is the result of argmin_reduce and argmax_reduce

KhronosGroup / NNEF-Docs

NNEF public repository

Apache License 2.0

14 stars 3 forks source link

What is the result of argmin_reduce and argmax_reduce #21

Open zoeoz opened 2 months ago

zoeoz commented 2 months ago

What is the result of argmin_reduce and argmax_reduce when multiple dimensions are reduced?

For example, consider the tensor:

3.2 4.5 1.3 2.7 1.4 3.2 9.1 2.3 1.4

If we argmax_reduce dimension 0 (columns), the result is [ 1, 2, 0 ]^T, and if we argmax_reduce dimension 1 (rows), the result is [ 2, 0, 1 ].

But what is the result if we argmax_reduce both dimensions?

It seems one plausible result could be 6, which is the global element offset to value 9.1 in the input tensor. However, this appears to be inconsistent with the current spec.

I believe the problem is that argmax_reduce and argmin_reduce are not separable across multiple reduction axes. Maybe the spec needs to be clear that only one dimension can be reduced with argmax_reduce and argmin_reduce, or perhaps a different definition is needed to specify result as a global element offset from the input tensor.

gyenesvi commented 2 months ago

Thanks for reporting this!

Yes, you are right that the spec is unclear about this, and also that argmin/argmax are not separable across the axes. The original intention was, as you guess, that the result should be the global (linearized) element offset, so 6 in the above case.

However, there may be some practical problems with that definition. The original idea was that the result should be the linearized offset within the subspace spanned by axes. For a single axis, that results in the index along the axis, and for all the axes of the tensor, it results in the global offset. While the definition is mathematically okay, in practise it could be complicated to use the result, as the indices cannot be directly used to access the original tensor; some flattening (reshaping) would be required to do so. However, in that case, the whole argmin/argmax index search could be done on the flattened tensor to start with. So it may be just cleaner and simpler to restrict argmin/argmax to single axis and resort to explicit flattening in cases where multiple axes need to be handled.

I have just checked other formats/frameworks. ONNX and TensorFlow only allows a single axis, and PyTorch supports a single axis and the global case (all axes) only. The current NNEF sample implementation also only supports a single axis. So I think restricting to a single axis would also be consistent with those.

Furthermore, I also realized that the spec does not define what happens in case of ties / multiple min/max values. In this case, the index of the first min/max value should be returned.

Also, note that there is another similar operation, argmax_pool, that does apply to multiple spatial dimensions. In that case, the spec does specify that the index is the global one within the pooling window (basically the subspace spanned by the pooling axes; the spatial dimensions). However, for that operation there exist other operations sample and desample that are meant to make use of those indices. But, these kind of operations (where gradients were propagated through max_pool operations through the actual max indices) went out of fashion and were never really used I guess. Also, for the argmax_pool operation, the spec does specify that in case of ties, the first index is reported.

So I'd propose to:

restrict the spec to a single axis (in this case, separability is not an issue, though it can be noted that separability does not apply to these operators)
and specify that in case of ties the first index should be reported

What do you think?

zoeoz commented 2 months ago

I think restricting the operation to one reduction axis is a possible fix, but also an approach that might be less useful than when you said: “the original idea was that the result should be the linearized offset within the subspace spanned by axes.”

I’m very intrigued you said this, because even though I didn’t mention it in my first post, this is a use-case that brought the issue to my attention in the first place. If you forgive me for a short tangent, I’ll elaborate a little more:

At our company, we are using NNEF to specify the graph for our training software. Since we are training, and not inferencing, the graph needs to calculate a loss function. To make this easier and more efficient, we’ve created a vendor extension that introduces compound operations for various loss functions. For example, one of the loss functions is sigmoid_cross_entropy:

fragment sigmoid_cross_entropy( x: tensor< scalar >, condition: tensor< logical > ) -> ( y: tensor< scalar > ) { n = neg( x ); a = select( condition, n, x ); y = softplus( a ); }

The idea is that x is the linear output of the last layer of the neural network, condition holds the training label for the binary class, and sigmoid_cross_entropy computes the cross entropy loss of sigmoid( x ) directly as a function of x. However, this compound function does it in a way that is more efficient and has more numerical stability than doing it the naïve way.

Now, we are also doing something similar, to define a compound function softmax_cross_entropy:

fragment softmax_cross_entropy( x: tensor< scalar >, one_hot: tensor< integer >, axes: integer[ ] = [ 1 ] ) -> ( y: tensor< scalar > ) { m = max_reduce( x, axes = axes ); z = sub( x, m ); # z is the same shape as x e = exp( z ); s = sum_reduce( e, axes = axes ); a = # magic happens here as function of z and one_hot! b = log( s ); y = sub( b, a ); }

The idea here is that x is again the linear output of the last layer of the neural network, one_hot holds the training label of the index to the “hot” categorial class as a linearized offset within the subspace spanned by axes, axes specifies the reduction axes, and softmax_cross_entropy computes the cross entropy loss of softmax( x ) directly as a function of x. Again, it does this in a manner where the compound function is more efficient and has more numerical stability than doing it the naïve way.

Since the number of categories can sometimes be very large, it becomes inefficient to encode a one hot vector explicitly. This is why one_hot is defined to be an integer tensor which has reduced shape. The integral value associated with each element of one_hot represents the index of the “hot” element of “the linearized offset within the subspace spanned by axes.” Then, in order to calculate a, we need to select these “hot” elements from z into a, which also has reduced shape.

So even though we are not explicitly using argmin_reduce or argmax_reduce in this example, it is “as if” one_hot was the result of one of these operations in a manner like what you describe. While working on this problem, and thinking about how to calculate a, I was pondering argmin_reduce and argmax_reduce semantics for similar reasons, and this is what led me to the question of these posts.

So back to: “the original idea was that the result should be the linearized offset within the subspace spanned by axes.”

Calculating the linearized offset with respect to the input tensor is trivial and happens anyhow when evaluating the operation. So this much is very easy. The question that remains: how easy or difficult is it to convert the linearized offset with respect to the input tensor into a linearized offset within the subspace spanned by axes?

If you read this blog post about indexing multidimensional arrays, we already know the extents of both the input and the reduced tensors. We also know the linearized offset with respect to the input tensor. I was starting to investigate the mathematics of conversion: how hard or easy would it be to compute the linearized offset within the subspace spanned by axes from all this information.

I haven’t finished. Or maybe someone else already knows the answer. If a reasonable formula exists, this may be the best or preferred solution.

Another way to approach this issue might be to have argmin_reduce and argmax_reduce operations always return the linearized offset with respect to the input tensor and then have separate standard operations that: (a) map the linearized offset with respect to the input tensor into the linearized subspace spanned by the axes; and, (b) select the actual values determined by (a) from the input tensor into a reduced output tensor (as in the case of where the "magic" happens in my softmax_cross_entropy example to calculate a from z and one_hot).

Or perhaps argmin_reduce and argmax_reduce compute (a) directly, as you suggested was the original intent of these operations, and then there is a second standard operation to perform (b).

P.S. I can vouch that the softmax_cross_entropy is still a mainstay in modern neural network design and won't be going away anytime soon. So I think it's a good real-world example use-case and not academic. The only thing is it is for training, and I know NNEF was originally conceived for inferencing. But in our experience, NNEF is perfect for training as well because the whole graph is the same except at the output layer we compute loss instead of activation.

gyenesvi commented 2 months ago

It's good to hear that you are using NNEF for training, I completely agree that it is suitable for it if the training system is able to deduce the backward graph. In fact these tools can also do that; it is possible to build the forward graph from NNEF into a PyTorch or TensorFlow NN module, and then let PyTorch/TF do the training. But as you note, NNEF is also suitable for describing the operations required for the back-propagation phase. In fact, the softmax example you show here is something I did play around with back when NNEF ops were originally designed, and I remember this use case, although I did not pursue it this far.

Going back to the actual argmin/argmax ops and how it would be useful to specify their output, I wonder if they could rely on commonly existing operations to fetch data based on indices: gather and gather_nd. NNEF currently only specifies the gather operation, which works with a single axis, but I don't quite see how it could be used to fetch the data designated by the indices returned by argmax/argmin, maybe it is possible with the right parameterization. Have you looked into those operations to see if it could be used for your purposes?

I tend to think that an operation to map indices between spanned subspaces is possible if the corresponding shapes are also known (basically what flattening and unflattening does under the hood) however it does not really constitute a good candidate for an NN operation as it operates purely on indices and not data.

zoeoz commented 2 months ago

For the softmax_cross_entropy example, NNEF’s gather operation doesn’t have the correct semantics, unfortunately. What’s needed is a new operation that I’ll call one_hot_gather, which has the following signature:

fragment one_hot_gather( input: tensor< scalar >, one_hot: tensor< integer > ) -> ( output: tensor< scalar > );

The argument validity of this operation requires, for any semantically valid value of axes, that the shape and semantics of one_hot are “as if” one_hot is the result of argmin_reduce( input, axes ) or argmax_reduce( input, axes ). We are also assuming in this case that argmin_reduce and argmax_reduce return the linearized offset within the subspace spanned by axes. The shape of output is the same as one_hot, and for every element of one_hot, the operation selects the corresponding value from input into output.

I’ve noted that axes is not a required argument for one_hot_gather, because this information is already implicitly captured in the differences between the shape of input and one_hot. For example, the extents of input and one_hot must be equal except for the dimensions in one_hot that are reduced to 1 (such dimensions were the elements of axes).

For completeness, I would also note that if:

a = min_reduce( input, axes ); b = max_reduce( input, axes ); u = argmin_reduce( input, axes ); v = argmax_reduce( input, axes ); x = one_hot_gather( input, u ); y = one_hot_gather( input, v );

then tensors a and x should be equal and tensors b and y should be equal.

Now in reality for the softmax_cross_entropy example, the one_hot tensor argument is not actually the result of argmin_reduce or argmax_reduce. Rather, it is a training label provided explicitly in the graph from an external tensor. However, it satisfies the “as if” semantics mentioned above, so it can be used to calculate a from z and one_hot:

a = one_hot_gather( z, one_hot ); # replaces “magic” in softmax_cross_entropy

Does it make sense?

gyenesvi commented 2 months ago

Yes, it makes sense, I understand what you are trying to do, and also agree that the gather operation is not suited for that.

However, if one_hot is like the result of argmin_reduce(z, axes), then a could be (re-)computed as a = min_reduce(z, axes), no? I understand that it looks like excess computation, and I think typically what can be done to avoid that is to also feed a back to the backward pass from the forward pass (where it has been already computed). Could that work for you?

zoeoz commented 2 months ago

The reason it doesn't work in the softmax_cross_entropy example is that one_hot is a training label provided from an external tensor and not from actual computation. So one_hot represents the ground truth and cannot be recomputed from data, i.e., min_reduce( z, axes ) and one_hot_gather( z, one_hot ) are not generally equal or equivalent.

To understand this in more detail, consider the following. Lets assume x is a simple row vector. If y = softmax( x ), and if t is the training label, which is a true one-hot vector where t is the same shape as x and y where every element of t is zero except for the k-th element, which is one, then the cross-entropy loss L equals -sum( t * log( y ) ). Since all elements of t are zero except element k, which is one, L reduces to -log( _y[k]_ ), which is equal to -log( e[k] / s ) or log( s ) - log( e[k] ); and, if a = z[k], then L = log( s ) - a. So this demonstrates why a cannot be computed from any data and must be obtained from _one_hotgather.

The derivative of L is just y - t.

gyenesvi commented 2 months ago

Oh, indeed, I forgot that the indices are coming from the desired targets and not the forward pass.

I guess I though it would be expressible somehow is because long ago I did play around with the backward pass of the softmax operation and I remember that I managed to formulate it using standard operations but then I started from a target vector represented as almost all zeros and a 1 in the target position (the t in your notation, right?), but that may be the less efficient way of calculating it.

zoeoz commented 2 months ago

The t in my notation appears to be the target vector that you mention (a.k.a the training label).

The other thing I'd point out is that the backward pass of softmax_cross_entropy only requires the forward evaluation of the softmax operation, since the derivative of softmax_cross_entropy is y - t, where y = softmax( x ). So you may have been looking at the backward pass of softmax directly, but that's not required here.

The formulation for softmax_cross_entropy using t directly is -sum( t * log( y ) ). So maybe you were looking at that. But this is less efficient because if the volume of t is large, it requires a similarly large number of multiplications with the elements of log( y ) followed by a sum reduction, when in reality the formula only requires an array access via _one_hotgather. Calculating log( y ) direclty can also result in unwanted numerical artifacts. This is why using the algebraic substitutions in my previous post is desirable, because it leads to very well-behaved numerical computation.

gyenesvi commented 2 months ago

Right, I understand. I'd say efficiency may be less of an issue here because the derivative of the loss function is probably a tiny fraction of the total compute load. But I guess numerical stability is more important here.

zoeoz commented 2 months ago

So if _argminreduce and _argmaxreduce return the linearized offset within the subspace spanned by axes, do you know what the formula or algorithm would look like? Same question for a hypothetical _one_hotgather operation. I'm wondering how easy or difficult these computations are.

gyenesvi commented 1 month ago

So if I understand correctly, you'd like to transform from indices in the subspace to global (flat) indices, right?

I guess in any case, to get one specific item from the tensor, you need indices in all dimensions; so besides the indices in the subspace, you also need to suppose that you have all the other indices in all other dimensions (for example you are iterating through all other dimensions). In that case it is simply the usual formula for implementing multi-dimensional arrays based on flat memory. Given the shape of the tensor (sizes in all dimension), it is possible to recursively calculate the offset in a flat array by starting from zero offset, iterating through the dimensions and multiplying the offset calculated so far by the size of the next dimension and adding the index in the next dimension. I guess you are familiar with that, just think about how you'd calculate the offset in a matrix.

It is also possible to go in the other direction, given the flat offset you can calculate all the n-dimensional indices by modulo and remainder operations with the appropriate sizes. Finally, when going from an n-dimensional index space to another m-dimensional index space, you can think about that as first going from n-dimensional to flat, and then from flat to m-dimensional.

Does that answer your question?

zoeoz commented 1 month ago

Can you provide an example of calculating all the indices by modulo and remainder operations? I believe I understand the rest of what you're saying but am not clear about this part.

gyenesvi commented 1 month ago

For example for a matrix of shape (m,n), let's suppose you have a plain buffer offset x of an item. Then to calculate the index (i,j) corresponding to that item in the matrix, you have:

j = x % n
i = (x / n) % m == (x / n)

where the division is among integers and rounds downwards.

In a more general case of d dimensions, suppose you have shape (s_1, s_2, ... , s_d), and you want to calculate indices (i_1, i_2, ... , i_d) from flat buffer offset x. Then, you can do that going backwards from s_d, iteratively taking modulo and dividing the offset:

i_{d} = x % s_{d}
i_{d-1} = (x / s_{d}) % s_{d-1}
i_{d-2} = ((x / s_{d}) / s_{d-1}) % s_{d-2}
...

Hope I did not mess up the dimensions/indexing..

Does that make it more clear?

zoeoz commented 1 month ago

It appears this formula only works when _s1, _s2, ..., _sd all have the same value.

Assuming row-major order, the global offset of a multidimensional array is:

offset = ( ( ( _i1 ) _s2 + _i2 ) ... ) * _sd + _id

So consider the shape [ 4, 4, 4, 4 ] and indices [ 2, 1, 2, 3 ]. In this case, offset equals 155. If we run the formula to calculate the indices from offset and the shape, it produces the correct outcome. Now change the shape to [ 7, 3, 5, 4 ] and use the same indices. In this case, offset equals 151 and if we run the formula to calculate the indices, it produces a result of [ 1, 1, 0, 3 ] instead of [ 2, 1, 2, 3].

gyenesvi commented 1 month ago

It works for me:

151 % 4 = 3,   151 / 4 = 37
37 % 5 = 2,   37 / 5 = 7
7 % 3 = 1,   7 / 3 = 2
2 % 7 = 2

so you get the indices in reverse order, as the result of the modulo operations with the shapes (in reverse order). In essence, it is a reversal of the formula that you have above for calculating offset.

zoeoz commented 1 month ago

Thank you, Viktor. I found my bug.

I'm going to update our NNEF compiler with these formulas and try training some models. I'll let you know how it goes, but I think at this point I expect that everything should work.

zoeoz commented 1 week ago

Hi Victor,

We’ve added this functionality to our NNEF compiler and have used it to train models. Everything works as expected.

After the development process, we learned a few things that gave us additional thoughts about standardizing the interface. The first thing we noticed is that once the results of an argument reduction are produced by argmin_reduce or argmax_reduce, these results may be used in a couple different ways.

For example, consider the following:

input =  // assume input is a scalar tensor with shape [ 7, 3, 5, 2 ]
offsets = argmax_reduce( input, axes = [ 0, 2, 3 ] );  # offsets has shape [ 1, 3, 1, 1 ]
    # the subspan of input relative to offsets is [ 7, 1, 5, 2 ]

For the sake of clarity, lets define “global” offsets as integer values that index into the shape of input, i.e., [ 7, 3, 5, 2 ], and lets define “local” offsets as integer values that index into the shape of the subspan [ 7, 1, 5, 2 ]. Let’s also define the shape of offsets as an example of a valid “reduction” of the shape of input, i.e., each extent of offsets equals the corresponding extent of input or 1.

Let’s also assume there is a new standard operator called gather_elements with the following interface:

fragment gather_elements< ? = scalar >( input: tensor< ? >, offsets: tensor< integer > ) -> ( output: tensor< ? > )

The semantics of gather_elements require:

The rank of offsets must be less than or equal to the rank of input.
Every extent of offsets must: (a) equal the corresponding extent of input, or (b) equal 1, i.e., the shape of offsets must be a valid “reduction” of input.
Every integer value in offsets must be a local offset into the subspan implied by the shape of input and offsets. For example, if input has shape [ 7, 3, 5, 2 ] and offsets has shape [ 1, 3 ], the implied subspan has shape [ 7, 1, 5, 2 ] and the integer values of offsets must index into the shape of this subspan.
The shape of output is the same as offsets.

If the standard defines that argmax_reduce returns local offsets, then we can gather the actual values from input like this:

values = gather_elements( input, offsets );

However, notice that this requires argmax_reduce must convert global offsets of input into local offsets, and it then requires gather_elements to convert the local offsets back into global offsets of input!

This is a lot of unnecessary computation to perform. Instead, let’s say that argmax_reduce returns global offsets and that gather_elements also requires the integer values of offsets to be global offsets into the shape of input. Then, the operation:

values = gather_elements( input, offsets );

produces exactly the same results as before, except now all the unnecessary computation is eliminated.

This example has prompted us to consider that it may not be wise to standardize the semantics of argument reduction instructions to return local offsets. Instead, would it not be better to standardize argument reduction instructions that return global offsets?

Note that other standard operations could still be introduced, if need be, that convert global offsets to local offsets or vice-versa. Since the argument reduction operations return integer tensors, such conversions would be bona-fide tensor operations. In our training use-case, for example, target values are local offsets which are introduced into the graph via an external operation:

targets = external< integer >( shape = [ 1, 3 ] );

At run time, we need to convert these local offsets to global offsets. So we define a new tensor operation:

fragment shape_from_subspan( offsets: tensor< integer >, shape: tensor<> ) -> ( output: tensor< integer > )

This operation requires that: (a) the extents of shape define the desired “global” index space, (b) the shape of offsets must be a valid “reduction” of the global extents defined by shape, and (c) the integer values in offsetsmust be local offsets into the subspan implied by offsets and shape. For example, if offsets has shape [ 1, 3 ] and shape has extents [ 7, 3, 5, 2 ], then the implied subspan is [ 7, 1, 5, 2 ], and the integer values in offsets must be local offsets into this subspan. The operation produces output, which has the same shape as offsets, but all integer values in output have been converted from local offsets to global offsets.

Wrapping it all together, we then define our softmax_cross_entropy operation as:

fragment softmax_cross_entropy( x: tensor< scalar >, target: tensor< integer > ) -> ( y: tensor< scalar > ) {
    axes = reduction_axes( x, target ); # axes implied by shape of x and target
    m = max_reduce( x, axes = axes );
    z = sub( x, m );
    e = exp( z );
    sum = sum_reduce( e, axes = axes );
    global_target = shape_from_subspan( target, z );
    gathered_z = gather_elements( z, global_target );
    y = log( sum ) – gathered_z;
}

Although we have endeavored to define softmax_cross_entropy as a compound operation, it still requires a hypothetical reduction_axes operation which deduces the axes attribute from the shape of x and target. Since NNEF doesn’t allow operations to return attribute results, this is not valid NNEF. Anyhow, we present softmax_cross_entropy as a compound operation with this hypothetical non-standard blemish mainly for pedagogical purposes.

To summarize, we implemented argmin_reduce and argmax_reduce operations that return global offsets instead of local offsets. We also implemented the version of gather_elements that accepts global offsets instead of local offsets. We suggest the NNEF committee consider this approach over the local offsets, since at least without further use cases to analyze, it seems like the global offset approach is more advantageous.

For training, we implemented softmax_cross_entropy using similar concepts. If NNEF would like to support training, we suggest adding sigmoid_cross_entropy and softmax_cross_entropy to the standard. Our suggestion is that softmax_cross_entropy should be defined as a primitive operation. Although we’ve shown that softmax_cross_entropy can almost be defined as a compound operation, it still requires the reduction_axes operation, which has non-standard semantics. Another reason to define softmax_cross_entropy as a primitive operation is that its derivative is much simpler to compute as:

dX = softmax( x, axes ) – one_hot( target );

Where one_hot is another hypothetical operation that gives the explicit one-hot vector representation of target. Again, although it would be feasible to define a one_hot operation so that softmax_cross_entropy could have a definition as a compound operation, it is much more efficient in practice to compute this part of the derivative directly inside a primitive operation with an explicit intrinsic formula supported by the compiler.

gyenesvi commented 1 week ago

Thanks for all the detailed feedback! I agree that probably converting between local and global offsets just complicates things, though different implementations may prefer different representations, so it's not trivial to tell. Also, another possibility is to use n (= length of axes) dimensional indices instead of a single offset (essentially a tuple). Operations like gather_nd and scatter_nd use that approach for handling multiple axes, and it seems like that approach avoids the local-global conversion problems, at the cost of somewhat more complex representation (the n-dimensional index is represented as n values along a new dimension of the index data). So there are various approaches to the problem to consider.

Also, I wanted to note that currently we are working on 2.0 version of NNEF, which will feature a much richer syntax to describe operations in detail, including shape computation and the actual output computation with math formulae. For example, as you mention correctly, reduction_axes above would be invalid syntax. In version 2.0, that will be possible to describe as an expression on attributes/shapes, which will be legal syntax (and separate from tensor computation). With all the machinery of version 2.0, describing custom operations will be possible as needed, so it will be possible to define a variant that you need exactly. Also, it will be possible to define various primitive operations required for training pass, though we haven't done that exercise yet. We will soon publish a first draft of version 2.0 for feedback and community development, and I'll keep you posted about it!

zoeoz commented 1 week ago

That's exciting news about NNEF 2.0. We will be following that, for sure.

We thought about n-dimensional indices instead of offsets too. But I think this does not allow avoidance of considering global vs. local coordinates, i.e., the issue is more fundamental and independent of the actual representation (indices vs. offsets).

For example, it seems you must always know if you are indexing a subspan [7,1,5,2] or the original shape [7,3,5,2], regardless of representation. Therefore, you may always be able to find a hypothetical use case that requires the opposite of whatever the "standardized" choice may be defined as, and the representation (indices vs. offsets) may affect implementation details but it does not affect the proper semantics.

I have a background in computer graphics, and this reminds me a lot of local and global coordinate frames. Some use cases in computer graphics require computations in global coordinates, while others require computations in local coordinates. Coordinate transformation is so fundamental, it is accepted as part of every computer graphics language standard.

I would suggest exploring what are the conceivable use cases for NNEF and try to identify which choice is best for the majority of the "most common" use cases, knowing that one choice will probably not satisfy all use cases and conversions will likely always be required for the other use cases. My suspicion is, that may be the best that can be expected.

gyenesvi commented 1 week ago

Not sure why n-dimensional indices have the same problem of global vs local. In that case, each component of the index is interpreted within a single dimension. Of course, you still need to know which dimensions (axes) they pertain to, but that's much simpler to propagate through the graph than to map indices between local and global dimensions.

As for exploring use cases and choosing the best one, unfortunately the situation with standardization is not so ideal. We have to support conversion of models that are out there in various formats and align with what other frameworks and engines use. If the majority uses an inconvenient representation, we still need to support that. And if there's a better one but no frameworks support it, it becomes a bit pointless to support it in practise. This was an unpleasant lesson to learn for me as I was also trying to 'clean things up', but in practise it did not always work out. But this is why I think a good approach could be to provide comprehensive support for custom operations, because then what is the standardized form becomes less relevant, as writing the custom one that fits your needs could become easy. This is why version 2.0 will put much focus on defining (custom) operators.

zoeoz commented 1 week ago

Not sure why n-dimensional indices have the same problem of global vs local. In that case, each component of the index is interpreted within a single dimension. Of course, you still need to know which dimensions (axes) they pertain to, but that's much simpler to propagate through the graph than to map indices between local and global dimensions.

I don't think this is accurate. To understand why, just think very carefully about this simple example:

input =  // assume input is a scalar tensor with shape [ 7, 3, 5, 2 ]
indices = argmax_reduce( input, axes = [ 0, 2, 3 ] );  # indices has shape [ 1, 3, 1, 1 ] + extra for a 4-dimensional index answer
    # the subspan implied by input and axes is [ 7, 1, 5, 2 ]
values = gather_elements( input, indices );

To formally define the argmax_reduce and gather_elements operations, you still need to specify if the n-dimensional index values in indices are global or local. If the n-dimensional indices are global, they must index into the shape [7,3,5,2] of input. If the n-dimensional indices are local, then they must index into the shape of the subspan [7,1,5,2]. It is not logically possible that both choices can be accommodated.

It either case, it is possible to make gather_elements work correctly. However, as in the case of using offsets instead of n-dimensional indices, if the choice is that the n-dimensional indices are local, then the gather_elements operation must still convert the n-dimensional indices from local to global in order to gather the correct values from input (there is no logical way to avoid conversion in this case).

In other words, changing the representation (indices vs. offsets) does not eliminate the need to specify the semantics. If argmax_reduce returns local n-dimensional indices, gather_elmements must still convert those local indices into global indices in order to return a proper result, and if argmax_reduce returns global n-dimensional indices, gather_elementsdoes not need to perform any conversions. Either way, it is the exact same situation as if you used offsets instead of n-dimensional indices (i.e., you have changed implementation details of the representation, but you have not eliminated the need to specify local and global semantics).

Another way of thinking the same proof is that any tensor offset is a function of the n-dimensional indices and extents of the shape that the n-dimensional indices index into. So mathematically, offsets and n-dimensional indices are logically equivalent and can be freely substituted for each other because they are different representations of the same offset into a tensor. Therefore, if a use case requires that you have to convert offsets between local and global, then this also means you will have to convert n-dimensional indices between local and global. And the vice-versa is true.

I would also point out; that using n-dimensional indices requires the rank of indices is now twice as big in this example. Since the NNEF tensor file format only allows 8 dimensions, we are already at the maximum.

gyenesvi commented 1 week ago

To formally define the argmax_reduce and gather_elements operations, you still need to specify if the n-dimensional index values in indices are global or local. If the n-dimensional indices are global, they must index into the shape [7,3,5,2] of input. If the n-dimensional indices are local, then they must index into the shape of the subspan [7,1,5,2]. It is not logically possible that both choices can be accommodated.

I would also point out; that using n-dimensional indices requires the rank of indices is now twice as big in this example. Since the NNEF tensor file format only allows 8 dimensions, we are already at the maximum.

I'm not sure we are on the same page with this representation; I don't understand why you say a global/local distinction would still exist in case of n-dimensional indices. In the example, in my mind:

indices = argmax_reduce( input, axes = [0,2,3]); # indices has shape [ 1, 3, 1, 1, 3] or [3, 3] 
                                                 # depending on whether we squeeze out the reduced dimensions

where the last 3 in the shape is because the reduction is over 3 axes. In general, the last dimension is the length of the index tuple, the number of axes the reduction is over. So it always adds only 1 extra dimension to the indices.

The advantage of this representation is that all index components (of the tuple) are interpreted along a single axis, where the local/global distinction is unnecessary / meaningless (they are the same), hence from a specification point of view it is clear, and void of implementation details (an offset can be problematic for an implementation that does not use continuous memory layout). From the point of view of implementation, of course, to gather the data from the input, the offsets to the data elements (or something equivalent) must be computed from the indices (all you need to know is which axes they pertain to), but I think that's simpler than converting between local and global offsets (and is a ubiquitous computation in n-dimensional computing).

zoeoz commented 1 week ago

Hi Viktor,

It still isn't true. Let me try to clear it up with a more explicit explanation.

For any given n-dimensional tensor, if I = { i_1, i_2, ..., i_n } are indices of a tensor element and E = { e_1, e_2, ..., e_n } are the extents of the tensor, then the offset of the tensor element is governed by the equation:

offset = ( ( ( i_1 ) * e_2 + i_2 ) * ... ) * e_n + i_n

Fundamentally, this equation is between offset, I, and E. In your example, I see that you are suggesting argmax_reduce should return I instead of offset. However, you have also failed to specify E. Therefore, your formal specification of argmax_reduce is incomplete, because we cannot even know how to compute offsets from the indices of I without knowing E!

When considering a formal specification for the argmax_reduce operation, and as the early exchanges on this thread shows, there are at least two reasonable choices for E. Either E is the shape of input, or E is the shape of the subspan implied by input and axes. The former defines I as global indices, and the latter defines I as local indices.

Therefore, it is not possible to avoid specifying the global or local semantics of argamx_reduce, even by returning I instead of offset, because a definition for E must be given in order to provide a complete formal specification of the operation.

My impression is that you may have made an assumption in your mind that the indices of I are global. I see that you also raise issues like the tensor data may not be tightly packed, and so you argue that indices are superior to offsets. I think that's debatable, but I'm also saying that's a completely separate and unrelated issue from what I'm talking about. Even though I have some thoughts on that topic, too, for the sake of trying to untangle what may otherwise be a conflation of unrelated topics in this thread, I'll leave it for now.

gyenesvi commented 1 week ago

Agree about the relation between offset, E and I. I guess I'm starting to understand the differences in our thinking.

What I had in mind is what argmax_reduce would return is not a complete I, but only those components designated by axes (unless the reduction is along all the axes), the remaining dimensions are encoded in the shape of the indices. So the indices are only meaningful if we also know what axes they pertain to (is this what you mean by local?). Hence, from the returned indices alone, it is not possible to compute an offset (unless it is complete). When the indices are passed back to gather_elements, it also receives the input, so it knows E, and it should also receive the axes. Maybe this was the different assumption in our minds? So that way, gather_elements can produce an output that has the same shape as input along the dimension that are not in axes, and 1 along the axes, where it fetches a single element designated by the given partial index. To actually fetch the elements, it loops through the remaining axes, that is how it will have all the index components to make the offset computation. Does that make sense?

So yes, in this respect the indices are local in the formulation I had in mind, so they are only meaningful if we also know the axes and pass them to the operations that will use the indices. It is a possible definition, but maybe not the best generic one. But my point is that there could be various possible formulations of this problem, maybe the most efficient is different for different problems, but custom operations would allow you to define what you actually need.

zoeoz commented 5 days ago

What I had in mind is what argmax_reduce would return is not a complete I, but only those components designated by axes (unless the reduction is along all the axes), the remaining dimensions are encoded in the shape of the indices. So the indices are only meaningful if we also know what axes they pertain to (is this what you mean by local?).

Maybe. I'm not sure we're completely on the same page, because when I hear you say "complete" and "incomplete," I don't entirely understand what the definitions are, or why you can't compute offsets of "incomplete" indices.

All I'm saying is that for argmax_reduce, we need to define E!

In my nomenclature, "local" is when E is the shape of the subspan implied by input and axes.

For example:

input =  // assume input is a scalar tensor with shape [ 7, 3, 5, 2 ]
output = argmax_reduce( input, axes = [ 0, 2, 3 ] );

Now, if E is "local", then the shape of E is [ 7, 1, 5, 2 ], because axes specifies that dimensions 0, 2, and 3 are reduced, and dimension 1 is not.

But, if E is "global," then the shape of E is the same as the shape of input, i.e. [ 7, 3, 5, 2]

Either way, we always know how to compute between offsets and indices because nothing about E is left unspecified.

Note: I do not try to squeeze out the singleton dimensions when E is "local," because the singleton dimensions of E implicitly encode what the original value of axes was in this case, and that turns out to be very valuable information. For example, it retains the integrity of the indices returned by argmax_reduce and alleviates the need for passing axes around to other operations, as you've suggested. Especially in our softmax_cross_entropy use case, this idea is taken advantage of.

Also: Do not confuse the shape of E with the shape of output! The shape of output should still be exactly as the current NNEF standard specifies. Remember, E is the shape used to calculate the offsets and/or indices which are returned in the actual integer values of output.

gyenesvi commented 2 days ago

By incomplete (or maybe partial is a better word) I simply mean that only index components corresponding to axes are returned. So the number of index components in an index tuple (pointing to a single item) may be less than the number of dimensions of input. They are not always n dimensional, they may be k < n dimensional. This is why such a multi-index is insufficient for calculating the original offset in input, you cannot substitute them into the formula you wrote above as some components are missing.

The reason I find your local vs global distinction hard to interpret, is that the index components returned correspond exactly to those dimensions where your local and global definition of E agree. So in that sense it simply does not matter; for the definition of argmax_reduce, we don't need to specify any definition of E, all we need to know is that k <= n indices are returned, corresponding to the axes of length k (and the axes are obviously interpreted wrt the input). That is enough information to use the indices to pick out the designated values, given that somehow we know which axes they pertain to (and we can loop over the remaining dimensions to complete the partial index tuples). And as you say, if we don't squeeze the reduced dimensions out of indices, then gather_elements does not even need to get axes again; as the singleton dimensions encode that. Though I think there's a corner case here, if multiple dimension in the original input shape happened to be singleton to start with, then we later cannot necessarily tell which dimension was reduced over just by looking at the shape of the returned indices tensor and comparing it with the shape if input, as it will be ambiguous.