sum_reduce and mean_reduce

zoeoz commented 5 years ago

What was the rationale to give sum_reduce a normalize attribute while also providing a mean_reduce operation, instead of just having sum_reduce and mean_reduce operations without the normalize attribute?

It seems the current definitions both require compile-time decisions by the programmer, so it's not clear to me the value of having the two equivalent definitions. It also causes sum_reduce to have a different parameter signature than the other reduce operations, which requires a little extra work to support in the parser and compiler.

gyenesvi commented 5 years ago

Originally, it started with the box and debox operations, which have the normalize attribute to achieve both average and sum pooling. But then we have added the avg_pool (= box + normalize) operation as well for convenience. Then we wanted to have the same for reductions, that's how we ended up with the current situtaion. I agree that it is redundant. I think the clear solution would be to do remove the normalize attribute from sum_reduce, do away with the box and debox operations and introduce sum_pool, sum_unpool and avg_unpool operations instead. But that would be a non backward compatible (although functionally equivalent) change..

zoeoz commented 5 years ago

Thanks I was curious and figured there was some history behind that.

As we look at mapping the output of the NNEF compiler to actual hardware, we're generally thinking of each NNEF operation, for example, as an OpenCL kernel (or similar concept). But this assumes each NNEF operation only performs one specific function. The normalize flag in this case essentially changes the function of the operation. So we really think of it as two separate operations: sum_reduce and mean_reduce.

You are in a better position to know what kind of impact changes might have on the NNEF user base. Since NNEF is still relatively new, I would think backwards compatibility isn't a really big issue yet. I can say from experience: the new 1.0.1 spec broke lots of our internal NNEF test suites because we were using variable operations in compositional syntax like the old GoogleNet examples in the 1.0 spec. But it wasn't very difficult to upgrade because there's not a long history with NNEF yet.

Of course, backwards compatibility becomes very important over time as adoption grows and the installed base becomes more entrenched.

gyenesvi commented 5 years ago

I think in general the one to one mapping does not stand, not just for this example. There are many operations where different parameterisations would give you different kernels. For example, in convolution, a depth-wise convolution (groups = 0) is quite different in implementation than regular one (groups = 1), and it could even be mapped to a list of kernel invocations when it is grouped with a value that is not 1 and not 0. Or 3x3 convolution can be executed with a Winograd implementation, where other values with a more general kernel. So it is totally hardware dependent how the graph is mapped to the hardware, some may fuse a sequence ops to one kernel, some may split one op to multiple kernels, etc. NNEF is a high level mathematical description, it has no goal of direct mapping to hardware.

zoeoz commented 5 years ago

Correct. I should have been more specific. In my mind I was thinking "generally" about what we've come to feel are the families of fundamental operations: the unary, binary and reduce operations. To your point in the broader view of the standard, even the matmul operation with the transpose flags could map to different algorithms that optimize memory access for the different indexing patterns, or use block partitioning methods, etc.

Anyhow, I'm not sure it's important. I was mainly agreeing with your comments about sum_reduce and mean_reduce

gyenesvi commented 5 years ago

Yes, you are correct about the matmul as well.

So is there any blocking issue here, or is it just okay to accept that the mappings are not one-to-one but may be one-to-many or many-to-one and then the sum_reduce with normalize = true and mean_reduce just fit that pattern?

zoeoz commented 5 years ago

Correct. No blocking issues here. If anything, it's mainly about streamlining the API interface in ways that you mentioned in your first response on this thread. Since NNEF is still new, I'd encourage to think about doing this before adoption becomes too entrenched. At very least, deprecation of normalize flag in sum_reduce would partly accomplish that goal without making any breaking changes.

gyenesvi commented 5 years ago

Okay, that is something to think about. Actually, then what would make sense to me is to add some convenience operations that wrap the box/debox operations as well, without breaking backward compatibility, but making the situation more consistent. So we'd have:

sum_reduce (deprecating the normalize flag) mean_reduce

sum_pool (= box w/o normalize) sum_unpool (= debox w/o normalize)

avg_pool (= box with normalize) avg_unpool (= debox with normalize) *

The ones marked with * would be newly added. What do you think?

zoeoz commented 5 years ago

Yes. I think you could deprecate box and debox as well. This would allow them to be gradually phased out just like the normalize flag in sum_reduce.

gyenesvi commented 5 years ago

Will think about that too, although my idea was to keep box/debox to derive the actual pooling operations from it.

zoeoz commented 5 years ago

Are you talking about the spec or the reference implementation?

As far as the spec goes, I'd think the average pooling operations could be defined in terms of the sum pooling operations, since they are the same except for normalization by volume.

gyenesvi commented 5 years ago

Talking about the spec. I thought about expressing the average via the sum, but there is no general way to calculate the volume, because there is no expression that calculates the product of a list of items if the length of the list is arbitrary. Furthermore, in case of reduction, the list itself to take the product of is not defined explicitly, but is derived from the shape of the input along the axes of reduction, and since we deprecated the shape_of function, there is no way to get the shape.

zoeoz commented 5 years ago

I'm not understanding what the difficulty is. For example, the result semantics of the box operation currently says:

When normalize = true, the output is divided with the volume of the kernel (as indicated by the size parameter).

For sum_pool this bullet item can be omitted, and for avg_pool it can simply say the output is divided with the volume of the kernel (as indicated by the size parameter). I don't understand what shape_of has to do with this, or even how box would somehow still be required.

gyenesvi commented 5 years ago

It's one thing to define it in words in the spec, or to define it formally via NNEF syntax, and I thought you mean the latter. If there was a builtin function or expression, say __prod__, that yields the product of a list, then we could define averages as compound fragments:

fragment avg_pool( input: tensor<scalar>, size: integer[], ... ) -> ( output: tensor<scalar> )
{
     output = sum_pool(input, size = size, ... ) / __prod__(size);
}

In case of reduce, it would have to look like:

fragment mean_reduce( input: tensor<scalar>, axes: integer[] ) -> ( output: tensor<scalar> )
{
     shape = shape_of(input);
     output = sum_reduce(input, axes = axes) / __prod__([for i in range_of(shape) if i in axes yield shape[i]]);
}

Of course, the box operation avoids this by just saying in words that it is normalised by the volume as indicated by the size parameter. And what I originally thought was to describe sum_pool and avg_pool as:

fragment sum_pool( input: tensor<scalar>, size: integer[], ... ) -> ( output: tensor<scalar> )
{
     output = box(input, size = size, ... , normalize = false);
}

fragment avg_pool( input: tensor<scalar>, size: integer[], ... ) -> ( output: tensor<scalar> )
{
     output = box(input, size = size, ... , normalize = true);
}

to show the correspondence. But I agree that it does not add too much extra info.

I'd be okay with deprecating box/debox, and for consistency, I would even go further and do some cleanup around max pooling (like define max_unpool), which is now defined via sample/desample, which is also more complicated that it really needs to be, and I'd deprecate those as well. But I am afraid that such a wave of deprecation would not be well accepted in the WG.

zoeoz commented 5 years ago

Generally I'm glad to hear the WG may have a skeptical view of such changes or deprecations. The WG is, after all, tasked with maintaining a standard, and it can be perceived negatively if things are changing all the time. However, NNEF is also very new. So I think it's natural to expect that initial adoptions and implementations may reveal a few things that with the benefit of hindsight may have been done a little better.

Case in point: the new 1.0.1 spec broke all of our NNEF files because no more variable commands in fragments. This wasn't even deprecation, it was outright non-backwards compatible change. However, this wasn't a big deal (at least, for us) since NNEF is new and it's not yet so entrenched, even in our own internal development process where we are starting to use it quite a bit.

With time, NNEF will become more entrenched, and the cost of making changes will increase. So question to consider now is if the long-term advantages of having a more streamlined API outweigh the short-term disruption it might cause to early adopters. I'm sure the WG has much better insight than I do about how many NNEF users currently exist and to what level is that base entrenched.

gyenesvi commented 5 years ago

In fact you get it right, the WG has to keep in mind that a standard should not change too often. But I agree that straightening things out in the beginning can be beneficial on the long run, especially if we do changes carefully, for example by making sure that it does not cause too much trouble for the active users.

As for the variable change, we thought it is okay because we did not expect that use case to actually exist (we did not know of any such uses except the examples that were shown in the spec but were not part of it). I wonder how you actually made use of variables in fragments. Did you use them via the layer level examples that were in the spec? Or something else?

zoeoz commented 5 years ago

We had copied the idea of the layer level examples that were shown in the Appendix of prior versions of the spec. This was our most common use of fragments. It's interesting to note that the prior spec mentioned "the real power of hierarchical descriptions becomes evident for networks that have even more repetitive structure. In GoogleNet, the 'inception' module is repeated, which can be described as follows..." and then provides NNEF syntax using the layer level examples. Indeed, since the 1.0.1 change we generally don't use fragments very much.

gyenesvi commented 5 years ago

I understand what you mean. An issue that we realized was that writing descriptions with many levels of hierarchy using increasingly more compound fragments is mostly useful when you want to describe a network manually. Then it saves you a lot of repetitions and gives you an elegant description. However, that is not the main use case of NNEF. NNEF is converted (written and read) by automatic tools, and those tools would not have the knowledge to organize those descriptions into many levels of fragments, unless the starting point (the formats from which we convert) already contain that information, or the targets (that read NNEF) can make use of that structural info (besides lowering it) which does not seem to be the case. For example, in the case of an inception module that you mention, the source formats don't contain that info, so it cannot be automatically written out, neither can the consumers make use of that info (it is too big a building block to implement in one piece), so they would just lower it back. So we realized that describing very large building blocks is not a real use case for fragments, and the more realistic use is to describe slightly more compound operations (without side effects), such as shuffle or separable_conv that actually have a chance to be mapped to a single op in a framework / inference engine.

So I am assuming that you used such fragments before because you wrote them manually, instead of generating them automatically, right?

zoeoz commented 5 years ago

Correct. We wrote them manually.

I think the rationale you give above was implied by the nature of the 1.0.1 changes. At least, this was our impression. When I first saw the changes, my jaw did drop but at the same time it made a lot of sense. Thanks for the confirmation.

I agree that by using fragments to describe the inception module, for example, the fragment was essentially being utilized as a rudimentary authoring tool, not unlike the C++ preprocessor. This level of abstraction is nice for manual authoring but as you mention too large to be useful when mapping to real-world inference engines (or training engines, in our case). It may be interesting to think of how third party authoring tools could be developed to provide this functionality. But generally speaking, I agree it doesn't need to be within the purview of the NNEF spec proper.

The changes therefore seem to put the focus on "flat" syntax in the graph definition as the main use case, with implementers making decisions if certain NNEF operations are mapped internally to a single operation or multiple operations (i.e., a fragment). It also seems to imply most implementers will likely not need to support the KHR_enable_fragment_definitions or KHR_enable_operator_expressions extensions, lowering the barrier to entry for adopters.

gyenesvi commented 5 years ago

Right, flat syntax is the core, since that is used mainly in a regular conversion process, and implementors can actually get away without implementing compositional syntax, that was the intention to lower the barrier to entry. At the same time, we were expecting most implementors to just use the readily available parser from Khronos.

One important use case of the KHR_enable_fragment_definitions and KHR_enable_operator_expressions extensions is the definition of custom operations (either primitive or compound), which allows the parser to handle non-standard operations as well and propagate them to further levels of the stack that actually know what to do with them. For example a consumer may know how to map a custom operation to its implementation, but the parser itself does not need to know about too much about it.

KhronosGroup / NNEF-Docs

sum_reduce and mean_reduce #17