Open PrometheusPi opened 5 years ago
I did not encounter this one. Looks a little strange, as our memory (pre)allocation strategy aims at preventing such things from happening. Is it possible to get a call stack or other information to help figure out when does it happen: at least, is it during the initialization stage or somewhere after the main PIC loop has already started?
After a brief offline discussion with @sbastrakov here are the details of the error message
The job crashed during startup.
with debug symbols -g
in CMAKE_CXX_FLAGS
via ccmake
I got the following error output (looks like no improvement to me 😕 - did I do something wrongly?):
CMAKE_CXX_FLAGS -Dlinux -g
This is the last verbose output before the crash:
You could try these two things to find the actual problem:
I remember that I had problems of this kind, too. It seemed as if I could utilize only half of a GPUs memory (or less?). Otherwise I ran into errors.
@psychocoderHPC The simulation crashed again.
@psychocoderHPC suggested the following changes:
The added output printed:
...
PIConGPUVerbose PHYSICS(1) | 26008 MiB free memory < 350 MiB required reserved memory (else path)
PIConGPUVerbose PHYSICS(1) | 26008 MiB free memory < 350 MiB required reserved memory (else path)
stderr
with a grep what -B 3
:
Module libpng/1.6.34-GCCcore-7.3.0 unloaded.
Module libpng/1.6.34-GCCcore-7.3.0 loaded.
terminate called after throwing an instance of 'CUDA::error'
what(): /scratch/ws/...-LPWFA_till_2019-09/picongpu/thirdParty/mallocMC/src/include/mallocMC/reservePoolPolicies/SimpleCudaMalloc_impl.hpp(42): error: out of memory
--
[taurusml4:148353] [12] /usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x200000d853f4]
[taurusml4:148353] *** End of error message ***
terminate called after throwing an instance of 'std::runtime_error'
what(): /scratch/ws/...-LPWFA_till_2019-09/picongpu/thirdParty/alpaka/include/alpaka/stream/StreamCudaRtAsync.hpp(90) 'cudaStreamCreateWithFlags( &m_CudaStream, 0x01)' returned error : 'cudaErrorMemoryAllocation': 'out of memory'!
--
[taurusml6:64795] [ 4] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x200000b3a5e0]
[taurusml6:64795] [ 5] /sw/installed/GCCcore/7.3.0/lib64/libstdc++.so.6(__cxa_throw+0x80)[0x200000b3aa90]
terminate called after throwing an instance of 'CUDA::error'
what(): /scratch/ws/...-LPWFA_till_2019-09/picongpu/thirdParty/mallocMC/src/include/mallocMC/reservePoolPolicies/SimpleCudaMalloc_impl.hpp(42): error: out of memory
--
[taurusml6:64795] [15] /usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x200000d853f4]
[taurusml6:64795] *** End of error message ***
terminate called after throwing an instance of 'CUDA::error'
what(): /scratch/ws/...-LPWFA_till_2019-09/picongpu/thirdParty/mallocMC/src/include/mallocMC/reservePoolPolicies/SimpleCudaMalloc_impl.hpp(42): error: out of memory
--
srun: error: taurusml4: task 6: Aborted
srun: Terminating job step 14399826.1
terminate called after throwing an instance of 'CUDA::error'
what(): /scratch/ws/...-LPWFA_till_2019-09/picongpu/thirdParty/mallocMC/src/include/mallocMC/reservePoolPolicies/SimpleCudaMalloc_impl.hpp(42): error: out of memory```
Node | JOB ID | status |
---|---|---|
taurusml1 | 14403281 | defective |
taurusml2 | 14403298 | defective |
taurusml3 | 14403318 | defective |
taurusml4 | 14403329 | defective |
taurusml5 | 14403334 | defective |
taurusml6 | 14403344 | defective |
taurusml7 | 14403350 | defective |
taurusml8 | 14403356 | defective |
taurusml9 | 14403363 | working |
taurusml10 | 14403367 | working |
taurusml11 | 14403376 | working |
taurusml12 | 14403388 | defective |
taurusml13 | 14403392 | working |
taurusml14 | 14403399 | defective |
taurusml15 | 14403411 | defective |
taurusml16 | 14403421 | working |
taurusml17 | 14403429 | working |
taurusml18 | 14403437 | working |
taurusml19 | 14403446 | working |
taurusml20 | 14403454 | working |
taurusml21 | 14403470 | working |
taurusml22 | 14403474 | working |
taurusml23 | 14403480 | defective |
taurusml24 | 14403484 | defective |
taurusml25 | 14403489 | working |
taurusml26 | 14403496 | working |
taurusml27 | 14403503 | working |
taurusml28 | 14403509 | working |
taurusml29 | 14403518 | defective |
taurusml30 | 14403543 | working |
taurusml31 | 14403554 | working |
taurusml32 | 14403575 | defective |
We are currently doing more test but I think the problem is coming from memory fragmentations.
We are querying the amount of free memory with the cuda function cudaMemGetInfo
. It looks like the free amount provided by this call is not taken half-filled pages into account.
Since we allocate first all memory which has a fixed size we have also a lot of small allocations, those allocations maybe get placed on different memory pages. When we then allocate the large memory chunk for the mallocMC heap the driver is not able to find enough free memory pages.
I found a issues but it is very old but also point to this problem.
TODO for tomorrow:
all nodes are now complete
only 17 of 32 are usable for our simulations. Thus two standard L(P)WFA simulations using 9 nodes will not run in parallel.
I reran the test over all nodes. So far, is is far from finished, but it looks like, now also taurusml30
has to be considered defective. (This might be caused by the changes @psychocoderHPC and me added in the debugging process.)
Wouldn't you expect a change whenever you run? I would expect, if the error is related to the current memory layout, that the error is also related to previous GPU usage or at least the somewhat random memory allocation at initialization. Or is my understanding of the problem wrong?
Crashed do not necessarily depend on the previous usage. (Only if more pages became faulty.)
Currently it looks like the defective nodes are mostly reproducible - but I will have a detailed look at that today.
As far as I understand @psychocoderHPC, the cause for the error is either a bug from our side (how we allocate memory) or a strange behavior of the nvcc
. He can explain it in more detail.
This BUG is maybe related to https://github.com/ComputationalRadiationPhysics/alpaka/issues/850
The problem is that mallocMC
is calling kernel without adding any stream. Normally this will block all other stream but since we are using cudaStreamCrateWithFlags
and disable this blocking behavior we need to review PIConGPU and mallocMC for side effects.
The last code fix @psychocoderHPC proposed solved the issue. There should be a fix out soon.
@psychocoderHPC: @StillerPatrick now has also issues on the V100 nodes that we see as defective. He is certain it has to do with both memory and MPI.
EDIT:
Sorry - this was just a cluster hick-up - no job ran so far.
EDIT2: fixed
Node | status |
---|---|
taurusml1 | defective |
taurusml2 | defective |
taurusml3 | defective |
taurusml4 | defective |
taurusml5 | defective |
taurusml6 | defective |
taurusml7 | defective |
taurusml8 | defective |
taurusml9 | defective |
taurusml10 | defective |
taurusml11 | defective |
taurusml12 | defective |
taurusml13 | defective |
taurusml14 | works |
taurusml15 | works |
taurusml16 | defective |
taurusml17 | defective |
taurusml18 | defective |
taurusml19 | defective |
taurusml20 | defective |
taurusml21 | defective |
taurusml22 | defective |
taurusml23 | defective |
taurusml24 | defective |
taurusml25 | defective |
taurusml26 | defective |
taurusml27 | defective |
taurusml28 | defective |
taurusml29 | defective |
taurusml30 | defective |
taurusml31 | defective |
taurusml32 | defective |
It looks like there have been changes.
@psychocoderHPC: Was the fix mentioned in this issue ever pushed into mainline? It looks as if @PrometheusPi encounters the same error again, see #3433.
No this was never going into the mainline. The workaround/fix is only fixing a broken system. We will write a reproduce and hopefully show that this is a driver issue and the notes should be restarted.
I recently started using the V100 on the ml partition of taurus again. With a very similar setup to what already ran, I got the following mallocMC error:
@steindev @sbastrakov @psychocoderHPC Have you encountered an false memory issue error on the ml partition before? If yes, how did you solve it?
UPDATE: Before I tested a simple default LWFA with 32 GPUs and that worked fine.