deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

Same model predicts different outputs on CPU and GPU #263

Closed stal12 closed 3 years ago

stal12 commented 3 years ago

Describe the bug The same model, loaded with ONNX, when fed with the same input data, gives different outputs on CPU and GPU.

To Reproduce Run this simple main, which is a modification of 3_onnx_import_net_from_file.cpp. It builds a network in GPU using an ONNX model of ResNet18, then predicts outputs for a single image and prints the result. Comment line 16 and decomment line 17 to perform the same in CPU.

#include "eddl/apis/eddl.h"

using namespace eddl;

int main(int argc, char **argv) { 
    // Download cifar
    download_cifar10();

    Net* net=download_resnet18(true,{3, 32, 32});  

    // Build model
    build(net,
          adam(0.001), // Optimizer
          {"softmax_cross_entropy"}, // Losses
          {"categorical_accuracy"}, // Metrics
          //CS_GPU({1}), // one GPU
          CS_CPU(), // CPU with maximum threads availables
          false       // The weights of the net must not be initialized to random values.
    );

    // Load and preprocess training data
    Tensor* x_train = Tensor::load("cifar_trX.bin");
    x_train->div_(255.0f);

    Tensor* input = x_train->select({ "0" });

    set_mode(net, 0);
    auto output = predict(net, { input });

    output[0]->print();

    return 0;
}

Output tensor is: GPU: [0.289209 0.283272 0.395361 0.293833 0.259562 0.157258 0.289797 0.252943 ... ] CPU: [4.160687 0.000000 0.000000 0.000000 0.000000 0.491386 0.000000 0.181782 ... ]

Expected behavior The output tensor should be the same, regardless of the computing service.

RParedesPalacios commented 3 years ago

The problem with that import is that you remove the Top layers of the resnet. Then it is expected that you attach something later:

layer l=getLayer(net,"top");
layer out=Softmax(Dense(l,10,true,"newdense")); // true is for the bias.

// create a new model from input output
layer in=getLayer(net,"input");

net=Model({in},{out}); 

That appears in the example.

and initialise the "newdense" layer later.

Also you can try with "false" instead of true to avoid removing the top of the layer

Try with these, in any case we should clarify this or check in the build method that the output is not established.

RParedesPalacios commented 3 years ago

removeLayer removes from output as well, then build should generate a error message since output layers doesn't match lisst of losses and metrics. In any case "true" is for removing last layers while "false" is for preserving the original imported net.

stal12 commented 3 years ago

Same code as before, but without removing the last layer of ResNet. Of course predicting 1000 classes on cifar10 doesn't make sense, but the output should not change between CPU and GPU.

#include "eddl/apis/eddl.h"

using namespace eddl;

int main(int argc, char **argv) { 
    // Download cifar
    download_cifar10();

   Net* net=download_resnet18(false,{3, 32, 32});  

    // Build model
    build(net,
          adam(0.001), // Optimizer
          {"softmax_cross_entropy"}, // Losses
          {"categorical_accuracy"}, // Metrics
          CS_GPU({1}), // one GPU
          //CS_CPU(), // CPU with maximum threads availables
          false       // Parameter that indicates that the weights of the net must not be initialized to random values.
    );

    // Load and preprocess training data
    Tensor* x_train = Tensor::load("cifar_trX.bin");
    x_train->div_(255.0f);

    Tensor* input = x_train->select({ "0" });

    set_mode(net, 0);
    auto output = predict(net, { input });

    output[0]->print();

    return 0;
}

GPU: [-0.294621 -0.227132 -0.264977 -0.534214 -0.086534 -0.003729 ...] CPU: [-2.110106 -1.483234 -1.964519 -2.233162 -1.505477 -0.294700 ...]

RParedesPalacios commented 3 years ago

Ok we are checking

RParedesPalacios commented 3 years ago

Solved