deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

ONNX Resize issue #297

Closed MicheleCancilla closed 3 years ago

MicheleCancilla commented 3 years ago

I tried a simpler model wrt the one in #295, but I got error on ONNX resize operator. What I've done:

The error occurs on Resize_100 when parsing the target shape:

if (node->input_size() > 3) // Get the new shape directly from input(3)
  {
    ...
else // Compute new shape from scale values
  {
    string scales_name = node->input(2);
    ...

Testing code:

int main(int argc, char** argv)
{
    int epochs = 2;
    string path("DeepLabV3_resnet18_simpl.onnx");

    // Import resnet18 model and reshape input for cifar
    Net* net = import_net_from_onnx_file(path, { 3, 224, 224 });

    // Build model
    build(net,
        adam(0.0001),              // Optimizer
        { "bce" }, // Losses
        { "mse" },  // Metrics
        CS_GPU({ 1 }, "low_mem"),  // Computing service (CPU or GPU)
        false                      // Parameter that indicates that the weights of the net must not be initialized to random values.
    );

    summary(net);

    // Load training data
    Tensor* x_train = Tensor::randn({ 128, 3, 224, 224 });
    Tensor* y_train = Tensor::randn({ 128, 1, 224, 224 });

    // Train few epochs the new layers
    fit(net, { x_train }, { y_train }, 1, epochs);

    return EXIT_SUCCESS;
}

ONNX file: download

chavicoski commented 3 years ago

Hi, I was able to run the testing code that you provided with any problem (in develop branch). You are using develop, right?

MicheleCancilla commented 3 years ago

Yes, I'm on develop (last commit) but on Visual Studio.

The exception states [libprotobuf FATAL D:\deephealth\eddl_dev\build_win\cmake\third_party\protobuf-src\src\google/protobuf/repeated_field.h:1537] CHECK failed: (index) < (current_size_): and occurs when string scales_name = node->input(2); is checked, since node->input_size() returns 2.

chavicoski commented 3 years ago

That is strange. The thing is that the Resize operator MUST provide the "scales" or the "sizes" input to compute the target shape. The "scales" input is at position 2 and the "sizes" input is at position 3, because the position 0 is for the input data and the 1 for an optional input named "roi".

Looking at the Resize_100 node in ONNX runtime, It has 4 inputs, but the second and the third are empty. That should give a input_size of 4. image

What is strange, is that debugging I also see that input_size() = 2, but when accessing input(2) it doesn't give me an error. I checked the inputs in each position using que input() function and I got that it really has 3 inputs (the error raises using input(3)) where the second (input(1)) and the third (input(2)) points to the same data (with the reference 364 shown in the image).

I have also seen that when loading the model with the python onnx library and displaying its content, the resize node should have 4 inputs '(%352, %, %, %364)': %367 = Resize[coordinate_transformation_mode = 'asymmetric', cubic_coeff_a = -0.75, mode = 'nearest', nearest_mode = 'floor'](%352, %, %, %364)

I seems that there is something wrong with the protobuf library loading the ONNX.

The node Resize_115 works and it is using the "scale" input, not the "sizes" input like Resize_100. Maybe as a workaround you can try to create both layers the same way to get the same node structure.

MicheleCancilla commented 3 years ago

I found out what was wrong. I set the opset_version=13 for the ONNX export, but changing it to 12 solved the problem.

 torch.onnx.export(model, dummy_input,
                  fname,
                  verbose=False,
                  export_params=True,
                  training=torch.onnx.TrainingMode.TRAINING,
                  opset_version=13, # <-- ERROR
                  keep_initializers_as_inputs=True,
                  do_constant_folding=False,
                  input_names=['input'],
                  output_names=['output'],
                  )

Which is the ONNX Opset currently implemented/supported in EDDL? I did not find this information in the docs.

FYI that "bugged" ONNX throws exception also on linux during debug: g++-9, cuda 11.0, cuDNN=ON, superbuild=ON

chavicoski commented 3 years ago

Ok, in the previous version of the Resize node (11) the "roi" and "scales" are not optional, so the length in that case should be correct.

Currently we don't have a fixed version supported. We implemented the import functions to try to match several versions, as for most of the cases the same implementation is compatible with several versions due to the changes between versions doesn't affect the functionalities that EDDL can import. But currently, most of the operators support up to versions 13, 12 and 11. In layers_onnx.h we have annotated the versions supported for each operator implementation.

We plan to implement op sets in a strict way to check if the version is compatible before importing and to dynamically change the version when exporting according to the operators used (now we export with version 11 always), since the lower the version, the easier it is to be compatible with other libraries.

MicheleCancilla commented 3 years ago

Many thanks @chavicoski