Open alrevuelta opened 4 years ago
@alrevuelta
the maxpool operator has two constraints (T
& I
), therefore both must be resolved.
Why even a constraint exists for a single type is beyond my knowledge.
I could add a check if a constraint only features a single type and skip it accordingly, but this would add complexity just for these corner cases.
@nopeslide Right, this is a specific case since constraint I
is always int64
. Having a look to the operators this happens in quite many of them, so I think we can do something like: If there are just two constrains and one is just 1 type, just "ignore" it.
On the other hand, maybe we could resolve ONLY according to the input type, and handle the output type inside each function.
I'm suggesting this because operators like EyeLike have many combinations (with the current implementation this would have 12 switch cases with 12 cases each = 144 combinations). I don't think we really need 144 functions. We can have just 12 (one per input constrain) and handle the output inside each function with a simple switch case.
This would also make the resolving process easier, because knowing the output type of a tensor in the resolving phase is not trivial. I mean, when we resolve the operators/functions, we don't really know what the output type is gonna be (i.e. we don't check the value of the attributes)
T1 : tensor(float16), tensor(float), tensor(double), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(bool)
Constrain input types. Strings and complex are not supported.
T2 : tensor(float16), tensor(float), tensor(double), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(bool)
Constrain output types. Strings and complex are not supported.
I'm implementing a new model that contains the operator constant and looks like an interesting case.
This operator doesn't have any input, just one attribute and an output. The value that indicates the output type is the attribute that is set, so the resolving should be according to the attributes and not to the input type. Just FYI.
@alrevuelta Leaving out the output constraints is a good idea. If we do so, the resolving needs to happen on demand. We could start by resolving each time the operator is called and come up later with a way to save the resolved function. Iirc the unconstrained executer is a trampoline which resolves the operator and executes it. We could replace the resolver function pointer in the context with this executer and later let it overwrite the same pointer in the context with the resolved constrained executer.
@nopeslide so you are suggesting that instead of resolving directly to the final function (operator + version + float/double...) we resolve only operator + version? If so, that could be the solution, but I liked the idea of preresolving everything before running inference. So lets think a bit more how we can avoid that.
Something more related to this that just came up with is the following. I'm specially concerned about this line all_context[nodeIdx].outputs[i]->data_type = 1;
in inference.c
. Currently thats hardcoded for float, but we need to correctly populate that parameter in the resolving phase. In most of the cases its just ok to assume that the output type will be the same as input, but for few operators this is not the case. This sucks because for some operators (like the Constant or Cast) we need to dig into the attributes to know what the type the output is going to be. So either we handle these operators as special cases or idk.
@alrevuelta
If so, that could be the solution, but I liked the idea of preresolving everything before running inference. So lets think a bit more how we can avoid that.
The only alternative I see would be a function per operator that returns the types, but imho that would more complicated than doing it on demand.
Just found another interesting case, OneHot
operator. The constraints are the following ones.
T1 : tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double) Constrains input to only numeric types.
T2 : tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double) Constrains input to only numeric types.
T3 : tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double), tensor(string), tensor(bool), tensor(complex64), tensor(complex128) Constrain to any tensor type.
This creates a huge number of permutations, which creates a resolve_operator_onnx_onehot_11.c
that is 2774 lines long, with a total of 1815 functions. I have no clue how we can solve this problem, but doesn't feel good to have 1815 different functions for an operator.
I would suggest generating less permutations, so instead of permutating T1, T2, T3, we could only permutate the two constraints that have more values (i.e. T2 and T3), or even just T3. Then we can handle all cases inside each function with a switch.
Having that many combinations also makes difficult to generate the operator status #44.
There are some other operators where we have the same issue: SequenceInsert
, SplitToSequence
just to name a few.
Maybe we need another approach?
Since the algorithm should not change and only the types change,
how about we have create a template implementation using preprocessor defines to choose arithmetic operators like this:
arith.h
inline int add_int_int(int a, int b) { return a + b; }
inline float add_int_float(int a, float b) { return ((float)a) + b; }
#define ADD(TA,A,TB,B) _ADD(TA,A,TB,B)
#define _ADD(TA,A,TB,B) add_##TA##_##TB(A,B)
#define DATA(T,TENSOR) _DATA(T,TENSOR)
#define _DATA(T,TENSOR) TENSOR.T##_data
operator_add.template
#include "arith.h"
operator_status operator_add( node_context *ctx) {
...
for(int i = 0; i < length; i++) {
DATA(T3,output)[i] = ADD( T1, DATA(T1,input1)[i], T2, DATA(T2,input2)[i] );
}
...
}
and let the code generator generate the file
operator_add_int_float.c
#define T1 int
#define T2 float
#define T3 float
#include "operator_add.template"
which would produce
inline int add_int_int(int a, int b) { return a + b; }
inline float add_int_float(int a, float b) { return ((float)a) + b; }
operator_status operator_add( node_context *ctx) {
...
for(int i = 0; i < length; i++) {
output.float_data[i] = add_int_float(input1.int_data[i],input2.float_data[i]);
}
...
}
this does not solve the problem of having all these permutations, but allows us to generate the permutations that are needed. I don't see another way to implement these things except for checking each type inside the executer, which is madness.
the code generator still does not support a type subsets (ignore specific types), which may also mitigate, but not solve this problem.
Nice idea but I'm not sure I want to rely on more autogenerated code (even if its with macros). If we can cover all the cases it should be fine, but I like having the flexibility of modifying the code directly. I haven't really worked with other types than float
but note that this might make the macros not as straight forward.
Some types like (int32
, int16
,...) are stored in the same variable. So if for have int8
, we would need to read chunks of 8 bits rather than the whole 32. Just FYI. Maybe we can even relate this to the Cast
operator.
/*
* For int32, uint8, int8, uint16, int16, bool, and float16 values
* float16 values must be bit-wise converted to an uint16_t prior
* to writing to the buffer.
* When this field is present, the data_type field MUST be
* INT32, INT16, INT8, UINT16, UINT8, BOOL, or FLOAT16
*/
size_t n_int32_data;
int32_t *int32_data;
There is one thing I noticed in some operators (like the above mentioned OneHot
). Some operators have as an input a list of indexes
(indices
for OneHot
) We could just "cast" that input to all int64
, so we can save up a lot of permutations. However, this has to be done manually as a specific case in the Python code. Actually, this operator has also a depth
input which is a "Scalar specifying the number of classes in one-hot tensor". We could also just cast it to int64
.
This is the thing I don't like about having that many autogenerated code. Its nice to take into account the particularities of each operator, and be able to design solutions more ad hoc. This OneHot
is the perfect example, implementing a function per indeces
and depth
is way overkill and maybe not even needed.
I ran some statistics with the operators, and these are the operators that have more than 15 combinations:
ai.onnx.preview.training Gradient1 45
ConcatFromSequence11 225
SequenceAt11 450
SequenceInsert11 450
ScatterElements11 30
GatherElements11 30
SplitToSequence11 450
SequenceErase11 30
Scatter9 30
Scatter11 30
OneHot9 1815
OneHot11 1815
EyeLike9 144
RandomUniformLike1 45
RandomNormalLike1 45
Slice10 30
Slice11 30
ai.onnx.ml DictVectorizer1 24
Cast1 144
Cast6 144
Cast9 169
Gather1 30
Gather11 30
Resize11 45
SequenceConstruct11 225
Pow12 55
Here is a script to generate it:
from onnx import onnx_cpp2py_export
import itertools
operators_with_perm = {}
all_schemas = [ operator for operator in onnx_cpp2py_export.defs.get_all_schemas_with_history()]
for operator in all_schemas:
constrainsts_type = [c.allowed_type_strs for c in operator.type_constraints]
operators_with_perm[operator.domain + " " + operator.name + str(operator.since_version)] = [i for i in itertools.product(*constrainsts_type)]
problematic_ops = {key:value for (key,value) in operators_with_perm.items() if len(value) > 15}
for key, value in problematic_ops.items():
print(key, len(value))
Less than 15 combinations should be fine, so here are the operators that we have to pay attention. OneHot
, SequenceAt11
, SequenceInsert11
, ConcatFromSequence11
, SplitToSequence11
, SequenceConstruct11
among others.
Some types like (
int32
,int16
,...) are stored in the same variable. So if for haveint8
, we would need to read chunks of 8 bits rather than the whole 32. Just FYI.
this would be no problem since we match the types through the DATA
macro.
if we use a short
aka int16
any access through DATA(...)[i]
should be type aligned.
There is one thing I noticed in some operators (like the above mentioned
OneHot
). Some operators have as an input a list ofindexes
(indices
forOneHot
) We could just "cast" that input to allint64
, so we can save up a lot of permutations. However, this has to be done manually as a specific case in the Python code. Actually, this operator has also adepth
input which is a "Scalar specifying the number of classes in one-hot tensor". We could also just cast it toint64
.
I think #40 is related here.
If we replace the resolver with the typeless executer, we essentially "decouple" the autogenerated code from the set generation by having the typeless operator as a hand-coded "proxy".
In the typeless executer we could prepare the input for a more generic implementation (like casting all indices to int64
)
I could implement a a new filter which stops the generator from writing specific files or permutations, so we just generate a subset of type permutations like operator_onehot_int64_int64_float
, operator_onehot_int64_int64_double
etc
I could implement a a new filter which stops the generator from writing specific files or permutations, so we just generate a subset of type permutations like operator_onehot_int64_int64_float, operator_onehot_int64_int64_double etc
Yep, I had something like this in mind. I think we can have some kind of table with "custom rules" to limit the permutations for a specific operator. So for example, if we know that OneHot
operator first two inputs can be constrained to just tensor(int64)
, we could define it with something like this:
#custom_rules.py
custom_rules = [
{"operator": "OneHot",
"version": [],
"constrained_inputs": [0, 1],
"constrained_types": [["tensor(int64)"], ["tensor(int64)"]]}
# More rules...
{"operator": "xx", "version": [], "constrained_inputs": [], "constrained_types": []}
]
This table would be used by the Python generator to limit the permutations that are generated. If someone in the future needs desperately support for one specific combination, they can just modify the rule and implement the newly generated combinations.
Some autogenerated resolvers look too complex, i.e.
resolve_operator__onnx__maxpool__12
. We don't need that many cases, just 5 "tensor(float16), tensor(float), tensor(double), tensor(int8), tensor(uint8)".Action to take: Rethink the
OperatorTypeResolver.py
script. @nopeslide