Exawind / amr-wind

AMReX-based structured wind solver
https://exawind.github.io/amr-wind
Other
106 stars 83 forks source link

OpenFAST chkp corrupted on frontier #984

Closed marchdf closed 6 months ago

marchdf commented 7 months ago

OpenFAST checkpoint files aren't closed properly. This is an issue on any machine we use but only results in corrupted chkp files on Frontier (that we've seen so far).

The reason for this is that OpenFAST assumes turbine ID numbering with fortran numbering (starting at 1) but we interface to the library using C numbering (starting at 0). So the following check (here):

IF (Turbine%TurbID == NumTurbines .OR. .NOT. PRESENT(Unit)) THEN
      CLOSE(unOut)
      unOut = -1
   END IF

never gets hit and the chkp file never gets closed.

The fix I am thinking about right now is to send a "fortran id" to OpenFAST from amr-wind. But this causes a segfault somewhere else. Still need to track that down.

@psakievich I am not sure how/if this affects the way Nalu-Wind interfaces with OpenFAST as well.

psakievich commented 7 months ago

In nalu-wind we rely on the openfast-cpp interface which also loops over indices starting at 0: https://github.com/OpenFAST/openfast/blob/4b6337fcffe859c5eeb5445deeef2046439e5152/glue-codes/openfast-cpp/src/OpenFAST.cpp#L58-L78

We'll have to dig deeper into openfast to see if this offset is handled inside the openfoam data structures. @gantech do you know off hand?

marchdf commented 7 months ago

Yeah that's interesting. Can you throw a print statement in that if condition I mentioned and see if it ever hits that close call? I am worried this isn't being handled right with the nalu-wind code path as well.

psakievich commented 7 months ago

I'm not in a position to test this at the moment. I can put it on my backlog though. This would be more of an openfast core issue than a nalu-wind issue.

lawrenceccheung commented 7 months ago

I tried a naive approach to get rid of the off-by-1 error. Basically I replaced this line https://github.com/Exawind/amr-wind/blob/d4dd236b4c00d20ac024003433ce0036a179914a/amr-wind/wind_energy/actuator/turbine/fast/FastIface.cpp#L167 with

        auto my_tid_local = fi.tid_local+1;
        fast_func(FAST_CreateCheckpoint, &my_tid_local, rst_file);

However that also resulted in segfaults when writing out the chkp files. So in the end I just commented out the if else check in https://github.com/OpenFAST/openfast/blob/4b6337fcffe859c5eeb5445deeef2046439e5152/modules/openfast-library/src/FAST_Subs.f90#L7090 that @marchdf mentioned above.

Lawrence

marchdf commented 7 months ago

Yeah that's not the right fix. Turns out this is a bit messed up. You need to increment global_id by one to pass in. That sets the TurbId, which would then fix the chkp close check. The problem is that, on initialize, that argument doesn't get passed around correctly: the call site of FAST_InitializeAll_T and the argument list don't match:

The definition of the function: https://github.com/OpenFAST/openfast/blob/main/modules/openfast-library/src/FAST_Subs.f90#L37 has TurbId as it's second parameter. So that's fine. Except that

The call site: https://github.com/OpenFAST/openfast/blob/main/modules/openfast-library/src/FAST_Library.f90#L144: passes in iTurb to that argument and not TurbId.

I am in conversation with @andrew-platt for other fixes.

marchdf commented 7 months ago

tracking this here: https://github.com/OpenFAST/openfast/issues/2064

andrew-platt commented 6 months ago

I did a little digging into this. There is a discrepancy between how we handle the turbine id with FAST.Farm and with the cpp interface. With FAST.Farm, we index using the Fortran index start of 1 for the turbine array between 1:NumTurbines. However, in the FAST_AllocateTurbines routine in FAST_Library.f90, a start index of 0 is expected.

As noted above, since the closing of the checkpoint file assumes the start index of 1, it is never closed with the cpp interface. So to fix this, I think we have three options.

  1. Change FAST.Farm to index turbines starting at 0,
  2. Change FAST_AllocateTurbines to start with a Fortran index of 1, and change amr-wind and other codes to match
  3. If FAST_AllocateTurbines is called, we could set an internal flag to correctly handle this offset of turbine number.

I'm inclined to pursue option 3 as this will preserve the existing numbering systems for cpp and FAST.Farm . I don't think it will be all that difficult to do in OpenFAST. I'll post here with when I have a proposed solution in place.

marchdf commented 6 months ago

Hi @andrew-platt , thank you for looking into this! I appreciate you digging through this. Option 3 sounds good. I do still think (as I noted in the openfast issue) that there seems to be an inconsistency in the way iTurb is being passed to FAST_InitializeAll_T and not the expected (from the argument list in the function definition) TurbId. But I probably don't understand the reasoning behind that and the intricacies of the coupling to all the fortran and cpp codes. Anyway, happy to help and test potential solutions. Thanks!

andrew-platt commented 6 months ago

Proposed solution: https://github.com/OpenFAST/openfast/pull/2097

marchdf commented 6 months ago

Confirmed that this is fixed with https://github.com/OpenFAST/openfast/pull/2097