Random memory allocation errors seen in Jenkins

simleo commented 3 years ago

In the past months we've seen memory allocation errors disrupting Jenkins builds in an unpredictable way, while running PyEDDL examples. Errors come from the get_fmem EDDL utility and most of the times they are triggered by ConvolDescriptor::build, e.g.:

Error allocating 225.00MB in ConvolDescriptor::build

But there are cases where they have been triggered by Tensor::resize:

Error allocating 179.44MB in Tensor::resize

The examples that most frequently trigger the error are the ones that use the Conv layer.

One problem is that these errors don't show up consistently. Most notably:

I never get these errors when I run the examples on our GPU-enabled machine at CRS4. However, that machine has 256 GB of RAM.
Errors are not consistent across different Jenkins builds or even different instances of the same build. Sometimes two different builds run the same test and it fails in one while it passes in the other. Sometimes a new commit is pushed with, say, only a change in the docs (i.e., no change whatsoever in either EDDL or PyEDDL code) and errors start showing up.
On one occasion, the only change in the commit was a modification of the Jenkinsfile itself. This broke the build, which returned back to normal after I reverted the commit. What's even more puzzling is that, after this, I re-introduced the changes I needed to the Jenkinsfile gradually, one commit at a time, and the build did not break anymore.
Today (2020/10/30) the pyeddl build went back to a passing state after I switched to the "low_mem" computing service setting in all examples. However, I'm not sure whether the two events are related or if it's just more random behavior.

Unfortunately, given the above circumstances, this seems hard to reproduce.

EDDL team: could you please have another look at get_fmem and see if you can spot any potential problems?
@prittt: do you have any idea on what could be happening on Jenkins?

Thanks!

salvacarrion commented 3 years ago

I think it could have been fixed #211 #210 @simleo Can you confirm?

simleo commented 3 years ago

I think it could have been fixed #211 #210 @simleo Can you confirm?

This has nothing to do with the EDDL tests. It happens while running PyEDDL examples.

salvacarrion commented 3 years ago

Just guessing in case it helps:

1 - I've seen those sorts of erratic errors on the continuous integrations too. In my case, they were erratic because depending on the day there was a different amount of memory available on the system (this depended upon the number of CI processes run at the same time, server overload, etc). My fix was to 1) add deletes at test time (I forgot them...), and 2) reduce the size of the "dummy" networks for testing (some of them needed around 500MB of memory). This could explain why it usually happens with convolutions since it is the layer that needs the largest amount of memory.

2 - This could also be related to the way we detect the amount of free/available memory:

https://github.com/deephealthproject/eddl/blob/9c0e5185d2cf22c50f6bc9f227f0407cf435d95e/src/utils.cpp#L135

I supposed this is not the best way to do it

salvacarrion commented 3 years ago

@simleo Can you check if this still happens with the new release v0.8a?

simleo commented 3 years ago

@simleo Can you check if this still happens with the new release v0.8a?

Yes. E.g., https://jenkins-master-deephealth-unix01.ing.unimore.it/job/DeepHealth/job/pyeddl/job/master/94/consoleFull

simleo commented 3 years ago

Also seen this in the NetTestSuite.net_delete_drive_seg_concat unit test:

[ RUN      ] NetTestSuite.net_delete_drive_seg_concat
CS with full memory setup
unknown file: Failure
C++ exception with description "Error allocating 576.00MB in ConvolDescriptor::build" thrown in the test body.
[  FAILED  ] NetTestSuite.net_delete_drive_seg_concat (1100 ms)

https://jenkins-master-deephealth-unix01.ing.unimore.it/job/DeepHealth-Docker/job/libs/133/consoleFull

salvacarrion commented 3 years ago

This is the responsible line: https://github.com/deephealthproject/eddl/blob/f754672fe491322acb8c4b18393743baa8129459/tests/net/test_memory.cpp#L474

It creates a U-Net that takes a lot of memory (for a unit test). I can either downsize it o remove it.

simleo commented 3 years ago

Changing that test might help, though unfortunately these allocation errors keep popping up also elsewhere :(

In https://jenkins-master-deephealth-unix01.ing.unimore.it/job/DeepHealth-Docker/job/libs/131/consoleFull

RuntimeError: Error allocating 179.44MB in Tensor::updateData

Also, in the above build n. 133, two more tests failed due to numerical comparisons, even though the image was compiled with HPC disabled. I've added a comment to https://github.com/deephealthproject/eddl/issues/218

RParedesPalacios commented 3 years ago

@salvacarrion is this issue solved?

salvacarrion commented 3 years ago

For the next release, heavy-memory test will be removed

deephealthproject / eddl

Random memory allocation errors seen in Jenkins #208