erdc / air-water-vv

Verification and validation tests for computational models of air/water flow
MIT License
5 stars 14 forks source link

code hangs on multi-cores #152

Closed yuxianglin closed 6 years ago

yuxianglin commented 7 years ago

Hi:

I am trying to run the solitary wave demo case on https://github.com/erdc-cm/air-water-vv/tree/adimako/solitaryWave/2d/solitary_wave/solitary_wave_01, this case runs perfectly on serial, parallel on 4 core, but hangs at more cores such as 8,16 when it begins time stepping, while I attach a gdb to it and get a backtrace like this: (gdb) bt

0 0x00002aaab383067d in MPIDI_Cray_shared_mem_coll_bcast () from /opt/cray/lib64/libmpich_gnu_49.so.3

1 0x00002aaab3840bbf in MPIR_CRAY_Barrier () from /opt/cray/lib64/libmpich_gnu_49.so.3

2 0x00002aaab376a263 in MPIR_Barrier_impl () from /opt/cray/lib64/libmpich_gnu_49.so.3

3 0x00002aaab376ac29 in PMPI_Barrier () from /opt/cray/lib64/libmpich_gnu_49.so.3

4 0x00002aaab1d06540 in PetscCommDuplicate () from /work/04328/shawnlin/lonestar/proteus/lonestar.gnu/lib/libpetsc.so.3.7

5 0x00002aaab1d0aa5d in PetscHeaderCreate_Private () from /work/04328/shawnlin/lonestar/proteus/lonestar.gnu/lib/libpetsc.so.3.7

6 0x00002aaab1ecedf5 in VecCreate () from /work/04328/shawnlin/lonestar/proteus/lonestar.gnu/lib/libpetsc.so.3.7

7 0x00002aaab1fb8af5 in MatCreateVecs () from /work/04328/shawnlin/lonestar/proteus/lonestar.gnu/lib/libpetsc.so.3.7

8 0x00002aaab28c8674 in KSPCreateVecs () from /work/04328/shawnlin/lonestar/proteus/lonestar.gnu/lib/libpetsc.so.3.7

9 0x00002aaab286f170 in KSPSetUp_GMRES () from /work/04328/shawnlin/lonestar/proteus/lonestar.gnu/lib/libpetsc.so.3.7

10 0x00002aaab28afcbd in KSPSetUp () from /work/04328/shawnlin/lonestar/proteus/lonestar.gnu/lib/libpetsc.so.3.7

11 0x00002aaab17a0699 in __pyx_pf_8petsc4py_5PETSc_3KSP_92setUp (__pyx_v_self=0x2aaad4cbdcb0) at src/petsc4py.PETSc.c:153327

12 pyx_pw_8petsc4py_5PETSc_3KSP_93setUp (pyx_v_self=0x2aaad4cbdcb0, pyx_args=, pyx_kwds=0x0) at src/petsc4py.PETSc.c:22235

13 0x00002aaaaaddb4d7 in call_function (oparg=, pp_stack=0x7fffffffa2e0) at Python/ceval.c:4350

14 PyEval_EvalFrameEx (f=f@entry=0x2aaae70a6b60, throwflag=throwflag@entry=0) at Python/ceval.c:2987

15 0x00002aaaaaddc8b0 in PyEval_EvalCodeEx (co=, globals=, locals=locals@entry=0x0, args=,

argcount=argcount@entry=1, kws=0x1d0be220, kwcount=1, defs=0x2aaac7ff05e8, defcount=1, closure=0x0) at Python/ceval.c:3582

16 0x00002aaaaaddb85b in fast_function (nk=, na=1, n=, pp_stack=0x7fffffffa4f0, func=) at Python/ceval.c:4446

17 call_function (oparg=, pp_stack=0x7fffffffa4f0) at Python/ceval.c:4371

I found a cure for this is to set initial condition to zero or other float at this line https://github.com/erdc-cm/air-water-vv/blob/adimako/solitaryWave/2d/solitary_wave/solitary_wave_01/twp_navier_stokes_p.py#L102 and L109, or replace the wave function to other waves. This is very weird because the returned velocity is just a small float number. Not sure this is due to petsc building or code conflict...

Any advise would be appreciated!

Yuxiang Lin

adimako commented 7 years ago

@yuanglin. This is very strange. To be honest I do not know if we have tried that before to use WaveTools for an initial condition. @alistairbnt, @tridelat, @giocozz, @cekees any ideas. Taking a shot in the dark, I would say that we should make sure that the initial condition is written by one processor only?

adimako commented 7 years ago

But I do see that there is a Barrier command involved so it should not be this

yuxianglin commented 7 years ago

The barrier seems inside petsc function

adimako commented 7 years ago

The other issue I am thinking is that the solitaryWave class is the only one that is not been included in the .pxd file. It is suspicious that you do not get this problem with the other wave classes

yuxianglin commented 7 years ago

@adimako unfortunately It still hangs even though I add those cdef and re-compile

yuxianglin commented 7 years ago

@adimako Another cure is to reduce the tank horizontal dimension https://github.com/erdc-cm/air-water-vv/blob/adimako/solitaryWave/2d/solitary_wave/solitary_wave_01/tank.py#L14 from 60 to 20, for example. Still wonder how this could make multi-cores hangs

adimako commented 7 years ago

@yuxianglin. I am almost convinced that it's the probes. Can you switch off the probes and try again? Also I think you're placing probes outside of your domain, can you confirm that this is true?

yuxianglin commented 7 years ago

@adimako You mean to switch off the gauge output? The demo case https://github.com/erdc-cm/air-water-vv/blob/adimako/solitaryWave/2d/solitary_wave/solitary_wave_01/tank.py#L17 is already false, and even I comment out all attach gauge lines, the code still hangs

adimako commented 7 years ago

Correct, I missed that. It still puzzles me a bit, as I have been running a numerical wave flume with arbitrary numbers of processors and it looks fine. So it does look like this is the initial conditions. For now I would set the initial conditions to zero and generate the solitary wave later in the simulation (e.g. move the frame of reference upstream of the domain boundary).

yuxianglin commented 7 years ago

@adimako So your machine doesn't not hang for this demo case with multi-cores? Hmm, is it maybe because the proteus is not properly configured on my TACC machines?

I suppose by setting velocity IC to zero would require the trans https://github.com/erdc-cm/air-water-vv/blob/adimako/solitaryWave/2d/solitary_wave/solitary_wave_01/tank.py#L23 to be moderate large negtive number, that would take some time to wait for the wave coming the center of flume.

adimako commented 7 years ago

To be honest, I have not the time to try it, but I will give it a go on 14x processors tomorrow

adimako commented 7 years ago

I cannot run the tank at all with the latest commit on master. I would advise taking this case https://github.com/erdc-cm/air-water-vv/tree/master/2d/numericalTanks/nonlinearWaves and modifying the waveTools class (and initial conditions) to account for a solitary wave. It might be that the case is a bit out of date

yuxianglin commented 7 years ago

@adimako is it because the dragAlpha argument is not present in https://github.com/erdc-cm/air-water-vv/blob/adimako/solitaryWave/2d/solitary_wave/solitary_wave_01/tank.py#L88

adimako commented 7 years ago

No no I fixed that, it is something else that I am getting, like an empty module error

yuxianglin commented 7 years ago

@adimako Did you code hang for this demo for cores more than 8?

adimako commented 7 years ago

@linyx199071 I could not run the case at all, even by fixing the arguments. At this stage, I would try using the 2d nonlinear tank in air water vv and change the wave generation to solitary waves. If you could do it and it is still failing, we have to give a closer look at the way the initial conditions are imposed