Problem with generated resolvers (2/2)

alrevuelta / cONNXr

Pure C ONNX runtime with zero dependancies for embedded devices

MIT License

193 stars 31 forks source link

Problem with generated resolvers (2/2) #25

Open alrevuelta opened 4 years ago

alrevuelta commented 4 years ago

Some autogenerated resolvers look too complex, i.e. resolve_operator__onnx__maxpool__12. We don't need that many cases, just 5 "tensor(float16), tensor(float), tensor(double), tensor(int8), tensor(uint8)".

Action to take: Rethink the OperatorTypeResolver.py script. @nopeslide

switch ( T ) {
        case 0: //constrained tensor is not set (maybe optional?), just take next case
        case ONNX__TENSOR_PROTO__DATA_TYPE__DOUBLE: { switch ( I ) {
            case 0: //constrained tensor is not set (maybe optional?), just take next case
            case ONNX__TENSOR_PROTO__DATA_TYPE__INT64: { executer = (operator_executer) &operator__onnx__maxpool__12__T_tensor_double__I_tensor_int64; break; }
            default: {
                fprintf(stderr, "no matching type for constraint 'I' found!\n");
                break;
            }
        } break; }
        case ONNX__TENSOR_PROTO__DATA_TYPE__FLOAT: { switch ( I ) {
            case 0: //constrained tensor is not set (maybe optional?), just take next case
            case ONNX__TENSOR_PROTO__DATA_TYPE__INT64: { executer = (operator_executer) &operator__onnx__maxpool__12__T_tensor_float__I_tensor_int64; break; }
            default: {
                fprintf(stderr, "no matching type for constraint 'I' found!\n");
                break;
            }
        } break; }
        case ONNX__TENSOR_PROTO__DATA_TYPE__FLOAT16: { switch ( I ) {
            case 0: //constrained tensor is not set (maybe optional?), just take next case
            case ONNX__TENSOR_PROTO__DATA_TYPE__INT64: { executer = (operator_executer) &operator__onnx__maxpool__12__T_tensor_float16__I_tensor_int64; break; }
            default: {
                fprintf(stderr, "no matching type for constraint 'I' found!\n");
                break;
            }
        } break; }
        case ONNX__TENSOR_PROTO__DATA_TYPE__INT8: { switch ( I ) {
            case 0: //constrained tensor is not set (maybe optional?), just take next case
            case ONNX__TENSOR_PROTO__DATA_TYPE__INT64: { executer = (operator_executer) &operator__onnx__maxpool__12__T_tensor_int8__I_tensor_int64; break; }
            default: {
                fprintf(stderr, "no matching type for constraint 'I' found!\n");
                break;
            }
        } break; }
        case ONNX__TENSOR_PROTO__DATA_TYPE__UINT8: { switch ( I ) {
            case 0: //constrained tensor is not set (maybe optional?), just take next case
            case ONNX__TENSOR_PROTO__DATA_TYPE__INT64: { executer = (operator_executer) &operator__onnx__maxpool__12__T_tensor_uint8__I_tensor_int64; break; }
            default: {
                fprintf(stderr, "no matching type for constraint 'I' found!\n");
                break;
            }
        } break; }
        default: {
            fprintf(stderr, "no matching type for constraint 'T' found!\n");
            break;
        }
    }

nopeslide commented 4 years ago

@alrevuelta the maxpool operator has two constraints (T & I), therefore both must be resolved. Why even a constraint exists for a single type is beyond my knowledge. I could add a check if a constraint only features a single type and skip it accordingly, but this would add complexity just for these corner cases.

alrevuelta commented 4 years ago

@nopeslide Right, this is a specific case since constraint I is always int64. Having a look to the operators this happens in quite many of them, so I think we can do something like: If there are just two constrains and one is just 1 type, just "ignore" it.

On the other hand, maybe we could resolve ONLY according to the input type, and handle the output type inside each function.

I'm suggesting this because operators like EyeLike have many combinations (with the current implementation this would have 12 switch cases with 12 cases each = 144 combinations). I don't think we really need 144 functions. We can have just 12 (one per input constrain) and handle the output inside each function with a simple switch case.

This would also make the resolving process easier, because knowing the output type of a tensor in the resolving phase is not trivial. I mean, when we resolve the operators/functions, we don't really know what the output type is gonna be (i.e. we don't check the value of the attributes)

T1 : tensor(float16), tensor(float), tensor(double), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(bool)
Constrain input types. Strings and complex are not supported.
T2 : tensor(float16), tensor(float), tensor(double), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(bool)
Constrain output types. Strings and complex are not supported.

alrevuelta commented 4 years ago

I'm implementing a new model that contains the operator constant and looks like an interesting case.

This operator doesn't have any input, just one attribute and an output. The value that indicates the output type is the attribute that is set, so the resolving should be according to the attributes and not to the input type. Just FYI.

nopeslide commented 4 years ago

@alrevuelta Leaving out the output constraints is a good idea. If we do so, the resolving needs to happen on demand. We could start by resolving each time the operator is called and come up later with a way to save the resolved function. Iirc the unconstrained executer is a trampoline which resolves the operator and executes it. We could replace the resolver function pointer in the context with this executer and later let it overwrite the same pointer in the context with the resolved constrained executer.

alrevuelta commented 4 years ago

@nopeslide so you are suggesting that instead of resolving directly to the final function (operator + version + float/double...) we resolve only operator + version? If so, that could be the solution, but I liked the idea of preresolving everything before running inference. So lets think a bit more how we can avoid that.

Something more related to this that just came up with is the following. I'm specially concerned about this line all_context[nodeIdx].outputs[i]->data_type = 1; in inference.c. Currently thats hardcoded for float, but we need to correctly populate that parameter in the resolving phase. In most of the cases its just ok to assume that the output type will be the same as input, but for few operators this is not the case. This sucks because for some operators (like the Constant or Cast) we need to dig into the attributes to know what the type the output is going to be. So either we handle these operators as special cases or idk.

nopeslide commented 4 years ago

@alrevuelta

If so, that could be the solution, but I liked the idea of preresolving everything before running inference. So lets think a bit more how we can avoid that.

The only alternative I see would be a function per operator that returns the types, but imho that would more complicated than doing it on demand.

alrevuelta commented 4 years ago

Just found another interesting case, OneHot operator. The constraints are the following ones.

T1 : tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double) Constrains input to only numeric types.
T2 : tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double) Constrains input to only numeric types.
T3 : tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double), tensor(string), tensor(bool), tensor(complex64), tensor(complex128) Constrain to any tensor type.

This creates a huge number of permutations, which creates a resolve_operator_onnx_onehot_11.c that is 2774 lines long, with a total of 1815 functions. I have no clue how we can solve this problem, but doesn't feel good to have 1815 different functions for an operator.

I would suggest generating less permutations, so instead of permutating T1, T2, T3, we could only permutate the two constraints that have more values (i.e. T2 and T3), or even just T3. Then we can handle all cases inside each function with a switch.

Having that many combinations also makes difficult to generate the operator status #44.

There are some other operators where we have the same issue: SequenceInsert, SplitToSequence just to name a few.

nopeslide commented 4 years ago

Maybe we need another approach? Since the algorithm should not change and only the types change, how about we have create a template implementation using preprocessor defines to choose arithmetic operators like this: arith.h

inline int add_int_int(int a, int b) { return a + b; }
inline float add_int_float(int a, float b) { return ((float)a) + b; }

#define ADD(TA,A,TB,B) _ADD(TA,A,TB,B)
#define _ADD(TA,A,TB,B) add_##TA##_##TB(A,B)

#define DATA(T,TENSOR) _DATA(T,TENSOR)
#define _DATA(T,TENSOR) TENSOR.T##_data

operator_add.template

#include "arith.h"
operator_status operator_add( node_context *ctx) {
...
for(int i = 0; i < length; i++) {
  DATA(T3,output)[i] = ADD( T1, DATA(T1,input1)[i], T2, DATA(T2,input2)[i] );
}
...
}

and let the code generator generate the file operator_add_int_float.c

#define T1 int
#define T2 float
#define T3 float
#include "operator_add.template"

which would produce

inline int add_int_int(int a, int b) { return a + b; }
inline float add_int_float(int a, float b) { return ((float)a) + b; }
operator_status operator_add( node_context *ctx) {
...
for(int i = 0; i < length; i++) {
  output.float_data[i] = add_int_float(input1.int_data[i],input2.float_data[i]);
}
...
}

this does not solve the problem of having all these permutations, but allows us to generate the permutations that are needed. I don't see another way to implement these things except for checking each type inside the executer, which is madness.

nopeslide commented 4 years ago

the code generator still does not support a type subsets (ignore specific types), which may also mitigate, but not solve this problem.

alrevuelta commented 4 years ago

Nice idea but I'm not sure I want to rely on more autogenerated code (even if its with macros). If we can cover all the cases it should be fine, but I like having the flexibility of modifying the code directly. I haven't really worked with other types than float but note that this might make the macros not as straight forward.

Some types like (int32, int16,...) are stored in the same variable. So if for have int8, we would need to read chunks of 8 bits rather than the whole 32. Just FYI. Maybe we can even relate this to the Cast operator.

/*
   * For int32, uint8, int8, uint16, int16, bool, and float16 values
   * float16 values must be bit-wise converted to an uint16_t prior
   * to writing to the buffer.
   * When this field is present, the data_type field MUST be
   * INT32, INT16, INT8, UINT16, UINT8, BOOL, or FLOAT16
   */
  size_t n_int32_data;
  int32_t *int32_data;

There is one thing I noticed in some operators (like the above mentioned OneHot). Some operators have as an input a list of indexes (indices for OneHot) We could just "cast" that input to all int64, so we can save up a lot of permutations. However, this has to be done manually as a specific case in the Python code. Actually, this operator has also a depth input which is a "Scalar specifying the number of classes in one-hot tensor". We could also just cast it to int64.

This is the thing I don't like about having that many autogenerated code. Its nice to take into account the particularities of each operator, and be able to design solutions more ad hoc. This OneHot is the perfect example, implementing a function per indeces and depth is way overkill and maybe not even needed.

I ran some statistics with the operators, and these are the operators that have more than 15 combinations:

ai.onnx.preview.training Gradient1 45
 ConcatFromSequence11 225
 SequenceAt11 450
 SequenceInsert11 450
 ScatterElements11 30
 GatherElements11 30
 SplitToSequence11 450
 SequenceErase11 30
 Scatter9 30
 Scatter11 30
 OneHot9 1815
 OneHot11 1815
 EyeLike9 144
 RandomUniformLike1 45
 RandomNormalLike1 45
 Slice10 30
 Slice11 30
ai.onnx.ml DictVectorizer1 24
 Cast1 144
 Cast6 144
 Cast9 169
 Gather1 30
 Gather11 30
 Resize11 45
 SequenceConstruct11 225
 Pow12 55

Here is a script to generate it:

from onnx import onnx_cpp2py_export
import itertools

operators_with_perm = {}
all_schemas = [ operator for operator in onnx_cpp2py_export.defs.get_all_schemas_with_history()]
for operator in all_schemas:
    constrainsts_type = [c.allowed_type_strs for c in operator.type_constraints]
    operators_with_perm[operator.domain + " " + operator.name + str(operator.since_version)] = [i for i in itertools.product(*constrainsts_type)]

problematic_ops = {key:value for (key,value) in operators_with_perm.items() if len(value) > 15}
for key, value in problematic_ops.items():
    print(key, len(value))

Less than 15 combinations should be fine, so here are the operators that we have to pay attention. OneHot, SequenceAt11, SequenceInsert11, ConcatFromSequence11, SplitToSequence11, SequenceConstruct11 among others.

nopeslide commented 4 years ago

Some types like (int32, int16,...) are stored in the same variable. So if for have int8, we would need to read chunks of 8 bits rather than the whole 32. Just FYI.

this would be no problem since we match the types through the DATA macro. if we use a short aka int16 any access through DATA(...)[i] should be type aligned.

There is one thing I noticed in some operators (like the above mentioned OneHot). Some operators have as an input a list of indexes (indices for OneHot) We could just "cast" that input to all int64, so we can save up a lot of permutations. However, this has to be done manually as a specific case in the Python code. Actually, this operator has also a depth input which is a "Scalar specifying the number of classes in one-hot tensor". We could also just cast it to int64.

I think #40 is related here. If we replace the resolver with the typeless executer, we essentially "decouple" the autogenerated code from the set generation by having the typeless operator as a hand-coded "proxy". In the typeless executer we could prepare the input for a more generic implementation (like casting all indices to int64) I could implement a a new filter which stops the generator from writing specific files or permutations, so we just generate a subset of type permutations like operator_onehot_int64_int64_float, operator_onehot_int64_int64_double etc

alrevuelta commented 4 years ago

I could implement a a new filter which stops the generator from writing specific files or permutations, so we just generate a subset of type permutations like operator_onehot_int64_int64_float, operator_onehot_int64_int64_double etc

Yep, I had something like this in mind. I think we can have some kind of table with "custom rules" to limit the permutations for a specific operator. So for example, if we know that OneHot operator first two inputs can be constrained to just tensor(int64), we could define it with something like this:

#custom_rules.py

custom_rules = [
{"operator": "OneHot",
"version": [],
"constrained_inputs": [0, 1],
"constrained_types": [["tensor(int64)"], ["tensor(int64)"]]}

# More rules...
{"operator": "xx", "version": [], "constrained_inputs": [], "constrained_types": []}
]

This table would be used by the Python generator to limit the permutations that are generated. If someone in the future needs desperately support for one specific combination, they can just modify the rule and implement the newly generated combinations.

[ ] If you agree with this I will open an issue with the task.