Closed simleo closed 3 years ago
I think it could have been fixed #211 #210 @simleo Can you confirm?
I think it could have been fixed #211 #210 @simleo Can you confirm?
This has nothing to do with the EDDL tests. It happens while running PyEDDL examples.
Just guessing in case it helps:
1 - I've seen those sorts of erratic errors on the continuous integrations too. In my case, they were erratic because depending on the day there was a different amount of memory available on the system (this depended upon the number of CI processes run at the same time, server overload, etc). My fix was to 1) add deletes at test time (I forgot them...), and 2) reduce the size of the "dummy" networks for testing (some of them needed around 500MB of memory). This could explain why it usually happens with convolutions since it is the layer that needs the largest amount of memory.
2 - This could also be related to the way we detect the amount of free/available memory:
I supposed this is not the best way to do it
@simleo Can you check if this still happens with the new release v0.8a?
@simleo Can you check if this still happens with the new release v0.8a?
Also seen this in the NetTestSuite.net_delete_drive_seg_concat
unit test:
[ RUN ] NetTestSuite.net_delete_drive_seg_concat
CS with full memory setup
unknown file: Failure
C++ exception with description "Error allocating 576.00MB in ConvolDescriptor::build" thrown in the test body.
[ FAILED ] NetTestSuite.net_delete_drive_seg_concat (1100 ms)
This is the responsible line: https://github.com/deephealthproject/eddl/blob/f754672fe491322acb8c4b18393743baa8129459/tests/net/test_memory.cpp#L474
It creates a U-Net that takes a lot of memory (for a unit test). I can either downsize it o remove it.
Changing that test might help, though unfortunately these allocation errors keep popping up also elsewhere :(
RuntimeError: Error allocating 179.44MB in Tensor::updateData
Also, in the above build n. 133, two more tests failed due to numerical comparisons, even though the image was compiled with HPC disabled. I've added a comment to https://github.com/deephealthproject/eddl/issues/218
@salvacarrion is this issue solved?
For the next release, heavy-memory test will be removed
In the past months we've seen memory allocation errors disrupting Jenkins builds in an unpredictable way, while running PyEDDL examples. Errors come from the
get_fmem
EDDL utility and most of the times they are triggered byConvolDescriptor::build
, e.g.:But there are cases where they have been triggered by
Tensor::resize
:The examples that most frequently trigger the error are the ones that use the
Conv
layer.One problem is that these errors don't show up consistently. Most notably:
"low_mem"
computing service setting in all examples. However, I'm not sure whether the two events are related or if it's just more random behavior.Unfortunately, given the above circumstances, this seems hard to reproduce.
get_fmem
and see if you can spot any potential problems?Thanks!