DOI-USGS / COAWST

COAWST modeling system git repository
Other
109 stars 52 forks source link

Problem with information transfer between SWAN and ROMS #328

Open Tucumeu opened 1 month ago

Tucumeu commented 1 month ago

Dear all,

I am using the 3.8 version to try to replicate a case that worked fine with the previous one (3.7), but I am running into an unexpected problem.

I have a smaller domain B nested into a larger A grid, and I want to run a coupled ROMS+SWAN two-way nesting simulation. All the input files are from the successful v3.7 run, so I assume they are ok. However, when I run the v3.8 model it works for a short while and then crashes due to a segmentation fault. I presume this happens when SWAN is trying to send wave data to ROMS, because I get the following onscreen message MCT::mAttrVect::indexRA:: FATAL--attribute not found: "DISBOT" Traceback: |X|MCT::mAttrVect::indexRA

When I run the coupled models in each domain separately, i.e., ROMS+SWAN in domain A, and the same in grid B, both simulations work well, so the issue appears only when I combine both models and both domains. I have re-made the connectivity and scrip files again and again, but the problem is still there. Any idea why this could happen?

The cluster I used for the v3.7 is not the same as the one I am using now for v3.8, but on the latter the Inlet_test/Refined case runs fine so I presume the problem is not related to the COAWST installation.

Thanks

jcwarner-usgs commented 1 month ago

when the simulation starts, one of the first things is to do a coupling exchange. Did this happen? did you see the DISBOT exchange, something like SWANtoROMS Min/Max DISBOT (Wm-2): 0.000000E+00 3.278870E-05 ... if so, then the disbot field is active. At some later time during that same simulation, it would be strange to have an mct call saying the disbot attr is not found. can you send the full stdout of that run? -j

Tucumeu commented 1 month ago

Hi John, Yes, there is an initial exchange between both SWAN grids to both ROMS grids.

I am attaching the log file for one of the failed runs. input_C.txt

jcwarner-usgs commented 1 month ago

can you set NINFO =1 and rerun that? there is a lot of info that is not being printed to that file. is there also an error out file? the error you report is not in that file.
i really think that roms may have blown up, and you are not seeing that written to the screen.

Tucumeu commented 1 month ago

Yes, I've set NINFO = 1, and also changed TI_OCN2WAV in coupling.in to make it an exact multiple of the ROMS timesteps DT (just in case...) and rerun the case. I'm attaching the new log file. The error I mention is not printed to the run log file, but is printed to the screen or in the slurm out file, which I also attach. On the other hand, there is no error file generated by SWAN; I've checked the SWAN PRINT files and they all look fine to me, without any error messages.

Missatge de john warner @.***> del dia dj., 24 d’oct. 2024 a les 14:40:

can you set NINFO =1 and rerun that? there is a lot of info that is not being printed to that file. is there also an error out file? the error you report is not in that file. i really think that roms may have blown up, and you are not seeing that written to the screen.

— Reply to this email directly, view it on GitHub https://github.com/DOI-USGS/COAWST/issues/328#issuecomment-2435186245, or unsubscribe https://github.com/notifications/unsubscribe-auth/BMKJJDOWV4NJYJVFRAY2PRTZ5DTDRAVCNFSM6AAAAABQL7I4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZVGE4DMMRUGU . You are receiving this because you authored the thread.Message ID: @.***>

jcwarner-usgs commented 1 month ago

oh. yes you need to have dt roms divide evenly into the coupling interval.
also need to have dt of swan divide evenly into the coupling interval. how do you submit the job? what is the command line? mpirun -np X ./coawstM input.file &> output_file

also, the log file was not attached

Tucumeu commented 1 month ago

True, sorry. I attach it now. The command I use is mpirun -np 30 coawstM coupling.in > test.log, assigning 6 mpi nodes to SWAN and 24 to ROMS.

test.log

slurm-10688509.log

jcwarner-usgs commented 1 month ago

this is strange. at the beginning all the models exchange:

== SWAN grid 1 sent wave data to ROMS grid 1 ** ROMS grid 1 recv data from SWAN grid 1 SWANtoROMS Min/Max DISBOT (Wm-2): 0.000000E+00 0.000000E+00 SWANtoROMS Min/Max DISSURF (Wm-2): 0.000000E+00 0.000000E+00 ...

then roms goes to 30 minutes 100 2022-01-01 00:30:00.00 2.199223E-03 3.167227E+02 3.167249E+02 8.086213E+10 01 (081,082,20) 0.000000E+00 2.834716E-03 2.372127E+00 1.497969E-01

and then swan to 30 mintues +time 20220101.003000 , step 3; iteration 12; sweep 4 grid 2 == SWAN grid 1 sent wave data to ROMS grid 1

then you get that error MCT::mAttrVect::indexRA:: FATAL--attribute not found: "DISBOT" Traceback:
|X|MCT::mAttrVect::indexRA 01B.MCT(MPEU)::die.: from MCT::mAttrVect::indexRA() [gs30r3b04:3453727:0:3453727] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

but disbot already existed.

Can i see your swan.in? i am not sure why you have so many interations per step.

when you run it, try mpirun -np 30 coawstM coupling.in &> test.log

can you look in the roms his file?
can you cahnge the coupling to be every 10 min? does it always stop at the first coupling exchange (after init). -j

can i see your

Tucumeu commented 2 weeks ago

John,

Apologies for the late reply. I am attaching the swan ini files for both domains, together with the log of the new run using your suggestion (mpirun -np 30 coawstM coupling.in &> test.log). To answer you questions, the data in the history files makes sense, and yes, the run always crashes the second time it tries to exchange information between models. In the meantime, I have run the exact same simulation with the same version of COAWST on a different cluster, and it has worked properly, so it seems that the problem would not be the input files, but the machine itself or some compilation option. I would discard the latter since the build_coawst.bash I use is the same as the one for the Inlet Test Refined case, which runs fine.

M

Missatge de john warner @.***> del dia dl., 28 d’oct. 2024 a les 15:41:

this is strange. at the beginning all the models exchange:

== SWAN grid 1 sent wave data to ROMS grid 1 ** ROMS grid 1 recv data from SWAN grid 1 SWANtoROMS Min/Max DISBOT (Wm-2): 0.000000E+00 0.000000E+00 SWANtoROMS Min/Max DISSURF (Wm-2): 0.000000E+00 0.000000E+00 ...

then roms goes to 30 minutes 100 2022-01-01 00:30:00.00 2.199223E-03 3.167227E+02 3.167249E+02 8.086213E+10 01 (081,082,20) 0.000000E+00 2.834716E-03 2.372127E+00 1.497969E-01

and then swan to 30 mintues +time 20220101.003000 , step 3; iteration 12; sweep 4 grid 2 == SWAN grid 1 sent wave data to ROMS grid 1

then you get that error MCT::mAttrVect::indexRA:: FATAL--attribute not found: "DISBOT" Traceback: |X|MCT::mAttrVect::indexRA 01B.MCT(MPEU)::die.: from MCT::mAttrVect::indexRA() [gs30r3b04:3453727:0:3453727] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

but disbot already existed.

Can i see your swan.in? i am not sure why you have so many interations per step.

when you run it, try mpirun -np 30 coawstM coupling.in &> test.log

can you look in the roms his file? can you cahnge the coupling to be every 10 min? does it always stop at the first coupling exchange (after init). -j

can i see your

— Reply to this email directly, view it on GitHub https://github.com/DOI-USGS/COAWST/issues/328#issuecomment-2441783427, or unsubscribe https://github.com/notifications/unsubscribe-auth/BMKJJDIHYGGJMP3ADXJD7RTZ5ZEJZAVCNFSM6AAAAABQL7I4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBRG44DGNBSG4 . You are receiving this because you authored the thread.Message ID: @.***>

Tucumeu commented 2 weeks ago

Here come the files, with the SWAN .in files renamed to .txt input_B_AB.txt test.log input_A_AB.txt