NOAA-GFDL / MOM6

Modular Ocean Model
Other
23 stars 55 forks source link

Truncation Output Issue with FMS2: Fileset Flag and Filename Handling #736

Open ezhilsabareesh8 opened 6 days ago

ezhilsabareesh8 commented 6 days ago

In the current version of MOM6 with FMS2, truncation outputs are not correctly produced when the fileset flag is set to SINGLE_FILE. This is due to the following:

Fileset Flag Behavior: The mpp_open routine is no longer used in FMS2 (reference). When writing to an ASCII file, the default FORTRAN open routine is used. In MOM_PointAccel.F90, truncation outputs are being overwritten when the fileset flag is SINGLE_FILE, leading to an empty truncation output. The relevant code is here and here:

call open_ASCII_file(CS%u_file, trim(CS%u_trunc_file), action=APPEND_FILE, &
                     threading=MULTIPLE, fileset=SINGLE_FILE)

Setting fileset=MULTIPLE resolves the issue, but it opens multiple files with processor-specific filenames (e.g., V_velocity_truncations.1072).

Also the file handle check here and here needs to be updated to if (CS%v_file == -1) since open(newunit=...) always returns negative file handles.

Filename Handling in MOM_io_infra.F90: The declaration of the filename variable in open_ASCII_file crashes due to memory handling here. Changing filename to a fixed length of 50 or higher characters prevents the crash. The modifications are:

character(len=50) :: filename

and updating the inquire and open statements to use trim(filename) here and here.

inquire(file=trim(filename), exist=exists)
open(newunit=unit, file=trim(filename), action=trim(action_arg), position=trim(position_arg))

These changes resolves issues with file handling across multiple processors.

marshallward commented 5 days ago

@ezhilsabareesh8 Thanks for identifying and documenting these issues. As you probably know, FMS has deprecated mpp_open, and with it ASCII write I/O support, so this must be handled on the MOM6 side. But we still need to retain compatibility with the FMS1 API, so some extra work may be needed to get that working.

So far, I think the [uv]_file < 0 tests definitely need to be replaced, using [uv]_file == -1, assuming that FMS1 still works. I'm not yet able to replicate an error with filename as an allocatable, but I can believe there is a problem.

I'm not yet sure how to address the parallel writes. APPEND_FILE should ensure that we don't lose any content, but it's obviously not producing consistent output. But maybe we can deal with this after the other two have been fixed.

ezhilsabareesh8 commented 5 days ago

Thanks @marshallward for the response and confirming the issue.

  • I'm not clear why filename needs to be changed from allocatable to fixed length. L427 should implicitly allocate the string. Can you post an example and the error?

This could be happening because when the fileset is set to MULTIPLE, the filenames become quite large. I encountered the same error as you:

forrtl: severe (66): output statement overflows record, unit -5, file Internal Formatted Write

However, when I switched to a fixed-length filename (with trimming in the inquire and open calls), this error stopped occurring. This might suggest that the allocatable version was having trouble handling longer filenames, especially when each rank writes a separate file.

  • After fixing the unit < 0 error, I can produce a useful truncation file, but it will sometimes contain duplicate entries. Almost certainly due to multiple ranks writing to the same file, but I am not sure why different ranks are reporting the same truncations. Is that what you see? Or are the problems more severe?

I haven't tested after fixing the file unit and setting fileset to SINGLE. In my case, when the file unit wasn’t corrected, the truncations were simply not written, and an empty truncation file was generated. So, I haven’t yet seen the issue with duplicate entries, but I agree this could be due to multiple ranks trying to write to the same file.

I suspect the problem with APPEND lies in the fact that the default Fortran's open routine doesn’t accept "APPEND" as a valid option for the action specifier. For example, in this line :

open(newunit=unit, file=trim(filename), action=trim(action_arg), &
     position=trim(position_arg))

Passing APPEND results in an invalid action, since Fortran's action specifier only supports READ, WRITE, or READWRITE. The old FMS mpp_io supported APPEND through the mpp_open call, but the Fortran open function doesn’t.

To append to the file, we may need to use the ACCESS specifier which supports values like APPEND, DIRECT, or SEQUENTIAL. You can refer to the Fortran open function specifiers and details here. This should ensure that multiple writes append correctly without losing content.

marshallward commented 4 days ago

IMO the best solution for the moment is to fix the [uv]_file < 0 checks. This would resolve the immediate problems and the truncation files would at least be usable. This still needs to be verified in the FMS1 API.

I believe that the filename length error is an uninitialized value being passed to open(), which is being misinterpreted as a massive string. We could resolve that in some way, but it would also never happen if the [uv]_file issue were fixed, since its value is associated with a nonempty file name. I would prefer to avoid a fixed length if possible.

Also note that APPEND is passed to the position argument, not to action. This also replicates the existing mpp_open() behavior. Using APPEND for access is an Intel extension and would not be standards compliant. (I believe this is why it is shown in green in your link.) IMO this is probably working as intended.

The more challenging question is whether to produce a coherent single file, or to juggle multiple per-rank files. Truncations are currently written as they happen, which avoids any buffering. But it also causes the concurrency issues described above. But I also think it's not an urgent problem and can be sorted out later.

marshallward commented 2 days ago

PR #739 addresses the [uv]_file < 0 error.