deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

Some logic functions don't work with HCP enabled #218

Closed salvacarrion closed 3 years ago

salvacarrion commented 3 years ago

Describe the bug Some logic functions don't work with HCP enabled

To Reproduce Steps to reproduce the behavior:

  1. Compile with -DBUILD_HPC=OFF
  2. Check the correctness of function below.
  3. Compile with -DBUILD_HPC=ON
  4. Repeat the tests and see it fail

Snippet:

Tensor* t1 = new Tensor({12, INFINITY, NAN, -INFINITY, 0.0f, +INFINITY}, {2,3});
Tensor* t2 = t1->isfinite();
t2->print(2); 

Wrong result:

// [
// [1.00 1.00 1.00]
// [1.00 1.00 1.00]
// ]

Expected behavior Same values regardless of the flag

// [
// [1.00 0.00 0.00]
// [0.00 1.00 0.00]
// ]

Desktop (please complete the following information):

Additional context When using "DEBUG" I don't notice this problem

salvacarrion commented 3 years ago

Can someone compile with these flags: -march=native -mtune=native -Ofast -msse -mfpmath=sse -ffast-math -ftree-vectorize to see which one causes the problemes?

Ps.: I guess it could be: -ffast-math

sanromra commented 3 years ago

All requested tests already performed.

Logic functions with unexpected behaviour are: isfinite isinf isnan isposinf isneginf

Unexpected behaviour occurs when EDDL is compiled with flags -Ofast or -ffast-math either compiling in Release or Debug modes.

simleo commented 3 years ago

On Jenkins I've seen some unit tests fail with numerical comparison errors even when compiling with -D BUILD_HPC=OFF. E.g., in https://jenkins-master-deephealth-unix01.ing.unimore.it/job/DeepHealth-Docker/job/libs/133/consoleFull:

[ RUN      ] TensorTestSuite.tensor_linalg_norm
/usr/local/src/eddl/tests/tensor/test_tensor_linalg.cpp:79: Failure
The difference between t_cpu_norm and t_gpu_norm is 0.1007080078125, which exceeds 10e-2f, where
t_cpu_norm evaluates to 999.3154296875,
t_gpu_norm evaluates to 999.4161376953125, and
10e-2f evaluates to 0.10000000149011612.
[ RUN      ] NetTestSuite.losses_binary_cross_entropy

[ RUN      ] NetTestSuite.losses_binary_cross_entropy

>>>>>>>>>>
[values]        185.487015 != 185.488495
[diff/epsilon]  0.001480 > 0.001000
<<<<<<<<<<
/usr/local/src/eddl/tests/losses/test_losses.cpp:139: Failure
Value of: Tensor::equivalent(t_cpu_delta, t_gpu_delta, 10e-4)
  Actual: false
Expected: true
[  FAILED  ] NetTestSuite.losses_binary_cross_entropy (1 ms)
salvacarrion commented 3 years ago

I'm aware of that.

Some operations are more susceptible than others to suffer from numerical errors. Also, when the results are compared between different devices or large tensors, the error deviation tends to increase.

The easiest workaround is to disable the HCP flag when testing (we do it) and then increase the margin error for the picky ones: 0.001 => 0.01

RParedesPalacios commented 3 years ago

Should we close this issue? @salvacarrion @simleo

salvacarrion commented 3 years ago

Yes, it will be fixed in the next released (Increase margin + Disable HPC flag for testing)