ExtremeFLOW / neko

/ᐠ. 。.ᐟ\ᵐᵉᵒʷˎˊ˗
https://neko.cfd/
Other
158 stars 27 forks source link

Simulation freeze when using "probes" #1239

Closed vbaconnet closed 2 months ago

vbaconnet commented 2 months ago

Problem observed

Running the rayleigh-benard-cylinder case with more than 1 rank will cause the simulation to freeze here:

   -----Starting simulation------  
   T  : [  0.0000000E+00,  0.2500000E+03)
   dt :    0.2000000E-02

      --------Postprocessing--------  

          --------Writer output---------  
          File name     : field.fld
          Output number :     0

The probes output file is created and data is written on disk as expected. However the output of field.fld is not behaving properly, the file is created but nothing is getting written in it.

Steps to reproduce

cd examples/rayleigh-benard-cylinder
makeneko rayleigh.f90
mpirun -n 2 ./neko rayleigh.case

What I tried

Changing the type of simulation component or removing the simulation components entirely from the case file did not cause the issue to appear.

Running with only 1 rank will also not cause the issue to appear.

Commenting out the header related lines in probes.F90 (here) did not fix the problem.

vbaconnet commented 2 months ago

It looks like the freeze is happening in fld_file.f90, here

njansson commented 2 months ago

It looks like the freeze is happening in fld_file.f90, here

Is this system specific or always occurs?

vbaconnet commented 2 months ago

I observed issues on my workstation (debian, gnu fortran) and on the nj computer. Occurs on Dardel GPU too with cray fortran, although the program freezes before the probes file is even created and written.

MartinKarp commented 2 months ago

This is a bug which has to do with the sequential nature of the I/O performed on csv files.

This means that the generic check_exists halts the simulation since inside one issues a broadcast.

I am finalizing my thesis for printing so I don't have that much time to look into it further, but commenting this line helps.

https://github.com/ExtremeFLOW/neko/blob/c10896c1a0d5c5a54cb1e056239014733e350ef5/src/io/csv_file.f90#L186

I might be able to have more of a look at the problem late next week.

njansson commented 2 months ago

This is a bug which has to do with the sequential nature of the I/O performed on csv files.

This means that the generic check_exists halts the simulation since inside one issues a broadcast.

I am finalizing my thesis for printing so I don't have that much time to look into it further, but commenting this line helps.

https://github.com/ExtremeFLOW/neko/blob/c10896c1a0d5c5a54cb1e056239014733e350ef5/src/io/csv_file.f90#L186

I might be able to have more of a look at the problem late next week.

A quick fix would be to do the check on rank 0, with a barrier afterwards

if (pe_rank .eq. 0) then
  call this%check_exists() 
end if
call MPI_barrier(NEKO_COMM)
njansson commented 2 months ago

This is a bug which has to do with the sequential nature of the I/O performed on csv files.

This means that the generic check_exists halts the simulation since inside one issues a broadcast.

I am finalizing my thesis for printing so I don't have that much time to look into it further, but commenting this line helps.

https://github.com/ExtremeFLOW/neko/blob/c10896c1a0d5c5a54cb1e056239014733e350ef5/src/io/csv_file.f90#L186

I might be able to have more of a look at the problem late next week.

A quick fix would be to do the check on rank 0, with a barrier afterwards


if (pe_rank .eq. 0) then

  call this%check_exists() 

end if

call MPI_barrier(NEKO_COMM)

No that doesn't fix it, I'll have a look today

MartinKarp commented 2 months ago

The issue is the call to MPI_bcast in check_exists, which must be called on all ranks, otherwise there is a mismatch.