ONNX Resize issue - Githubissues

MicheleCancilla commented 3 years ago

I tried a simpler model wrt the one in #295, but I got error on ONNX resize operator. What I've done:

remove grouped convolutions
substitute bilinear upsamplings (not supported by eddl) with nearest ones

The error occurs on Resize_100 when parsing the target shape:

if (node->input_size() > 3) // Get the new shape directly from input(3)
  {
    ...
else // Compute new shape from scale values
  {
    string scales_name = node->input(2);
    ...

Testing code:

int main(int argc, char** argv)
{
    int epochs = 2;
    string path("DeepLabV3_resnet18_simpl.onnx");

    // Import resnet18 model and reshape input for cifar
    Net* net = import_net_from_onnx_file(path, { 3, 224, 224 });

    // Build model
    build(net,
        adam(0.0001),              // Optimizer
        { "bce" }, // Losses
        { "mse" },  // Metrics
        CS_GPU({ 1 }, "low_mem"),  // Computing service (CPU or GPU)
        false                      // Parameter that indicates that the weights of the net must not be initialized to random values.
    );

    summary(net);

    // Load training data
    Tensor* x_train = Tensor::randn({ 128, 3, 224, 224 });
    Tensor* y_train = Tensor::randn({ 128, 1, 224, 224 });

    // Train few epochs the new layers
    fit(net, { x_train }, { y_train }, 1, epochs);

    return EXIT_SUCCESS;
}

ONNX file: download

chavicoski commented 3 years ago

Hi, I was able to run the testing code that you provided with any problem (in develop branch). You are using develop, right?

MicheleCancilla commented 3 years ago

Yes, I'm on develop (last commit) but on Visual Studio.

The exception states [libprotobuf FATAL D:\deephealth\eddl_dev\build_win\cmake\third_party\protobuf-src\src\google/protobuf/repeated_field.h:1537] CHECK failed: (index) < (current_size_): and occurs when string scales_name = node->input(2); is checked, since node->input_size() returns 2.

chavicoski commented 3 years ago

That is strange. The thing is that the Resize operator MUST provide the "scales" or the "sizes" input to compute the target shape. The "scales" input is at position 2 and the "sizes" input is at position 3, because the position 0 is for the input data and the 1 for an optional input named "roi".

Looking at the Resize_100 node in ONNX runtime, It has 4 inputs, but the second and the third are empty. That should give a input_size of 4.

What is strange, is that debugging I also see that input_size() = 2, but when accessing input(2) it doesn't give me an error. I checked the inputs in each position using que input() function and I got that it really has 3 inputs (the error raises using input(3)) where the second (input(1)) and the third (input(2)) points to the same data (with the reference 364 shown in the image).

I have also seen that when loading the model with the python onnx library and displaying its content, the resize node should have 4 inputs '(%352, %, %, %364)': %367 = Resize[coordinate_transformation_mode = 'asymmetric', cubic_coeff_a = -0.75, mode = 'nearest', nearest_mode = 'floor'](%352, %, %, %364)

I seems that there is something wrong with the protobuf library loading the ONNX.

The node Resize_115 works and it is using the "scale" input, not the "sizes" input like Resize_100. Maybe as a workaround you can try to create both layers the same way to get the same node structure.

MicheleCancilla commented 3 years ago

I found out what was wrong. I set the opset_version=13 for the ONNX export, but changing it to 12 solved the problem.

 torch.onnx.export(model, dummy_input,
                  fname,
                  verbose=False,
                  export_params=True,
                  training=torch.onnx.TrainingMode.TRAINING,
                  opset_version=13, # <-- ERROR
                  keep_initializers_as_inputs=True,
                  do_constant_folding=False,
                  input_names=['input'],
                  output_names=['output'],
                  )

Which is the ONNX Opset currently implemented/supported in EDDL? I did not find this information in the docs.

FYI that "bugged" ONNX throws exception also on linux during debug: g++-9, cuda 11.0, cuDNN=ON, superbuild=ON

chavicoski commented 3 years ago

Ok, in the previous version of the Resize node (11) the "roi" and "scales" are not optional, so the length in that case should be correct.

Currently we don't have a fixed version supported. We implemented the import functions to try to match several versions, as for most of the cases the same implementation is compatible with several versions due to the changes between versions doesn't affect the functionalities that EDDL can import. But currently, most of the operators support up to versions 13, 12 and 11. In layers_onnx.h we have annotated the versions supported for each operator implementation.

We plan to implement op sets in a strict way to check if the version is compatible before importing and to dynamically change the version when exporting according to the operators used (now we export with version 11 always), since the lower the version, the easier it is to be compatible with other libraries.

MicheleCancilla commented 3 years ago

Many thanks @chavicoski

deephealthproject / eddl

ONNX Resize issue #297