deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

Random memory allocation errors seen in Jenkins #208

Closed simleo closed 3 years ago

simleo commented 3 years ago

In the past months we've seen memory allocation errors disrupting Jenkins builds in an unpredictable way, while running PyEDDL examples. Errors come from the get_fmem EDDL utility and most of the times they are triggered by ConvolDescriptor::build, e.g.:

Error allocating 225.00MB in ConvolDescriptor::build

But there are cases where they have been triggered by Tensor::resize:

Error allocating 179.44MB in Tensor::resize

The examples that most frequently trigger the error are the ones that use the Conv layer.

One problem is that these errors don't show up consistently. Most notably:

Unfortunately, given the above circumstances, this seems hard to reproduce.

Thanks!

salvacarrion commented 3 years ago

I think it could have been fixed #211 #210 @simleo Can you confirm?

simleo commented 3 years ago

I think it could have been fixed #211 #210 @simleo Can you confirm?

This has nothing to do with the EDDL tests. It happens while running PyEDDL examples.

salvacarrion commented 3 years ago

Just guessing in case it helps:

1 - I've seen those sorts of erratic errors on the continuous integrations too. In my case, they were erratic because depending on the day there was a different amount of memory available on the system (this depended upon the number of CI processes run at the same time, server overload, etc). My fix was to 1) add deletes at test time (I forgot them...), and 2) reduce the size of the "dummy" networks for testing (some of them needed around 500MB of memory). This could explain why it usually happens with convolutions since it is the layer that needs the largest amount of memory.

2 - This could also be related to the way we detect the amount of free/available memory:

https://github.com/deephealthproject/eddl/blob/9c0e5185d2cf22c50f6bc9f227f0407cf435d95e/src/utils.cpp#L135

I supposed this is not the best way to do it

salvacarrion commented 3 years ago

@simleo Can you check if this still happens with the new release v0.8a?

simleo commented 3 years ago

@simleo Can you check if this still happens with the new release v0.8a?

Yes. E.g., https://jenkins-master-deephealth-unix01.ing.unimore.it/job/DeepHealth/job/pyeddl/job/master/94/consoleFull

simleo commented 3 years ago

Also seen this in the NetTestSuite.net_delete_drive_seg_concat unit test:

[ RUN      ] NetTestSuite.net_delete_drive_seg_concat
CS with full memory setup
unknown file: Failure
C++ exception with description "Error allocating 576.00MB in ConvolDescriptor::build" thrown in the test body.
[  FAILED  ] NetTestSuite.net_delete_drive_seg_concat (1100 ms)

https://jenkins-master-deephealth-unix01.ing.unimore.it/job/DeepHealth-Docker/job/libs/133/consoleFull

salvacarrion commented 3 years ago

This is the responsible line: https://github.com/deephealthproject/eddl/blob/f754672fe491322acb8c4b18393743baa8129459/tests/net/test_memory.cpp#L474

It creates a U-Net that takes a lot of memory (for a unit test). I can either downsize it o remove it.

simleo commented 3 years ago

Changing that test might help, though unfortunately these allocation errors keep popping up also elsewhere :(

In https://jenkins-master-deephealth-unix01.ing.unimore.it/job/DeepHealth-Docker/job/libs/131/consoleFull

RuntimeError: Error allocating 179.44MB in Tensor::updateData

Also, in the above build n. 133, two more tests failed due to numerical comparisons, even though the image was compiled with HPC disabled. I've added a comment to https://github.com/deephealthproject/eddl/issues/218

RParedesPalacios commented 3 years ago

@salvacarrion is this issue solved?

salvacarrion commented 3 years ago

For the next release, heavy-memory test will be removed