NOAA-OWP / ngen

Next Generation Water Modeling Engine and Framework Prototype
Other
84 stars 62 forks source link

Fix NetCDF file close error running MPI with NetCDF forcing #819

Closed stcui007 closed 4 months ago

stcui007 commented 4 months ago

This PR provides a fix for running ngen in MPI with NetCDF forcing where a NetCDF file close error is generated at end of run for each MPI process. Fixes #749, #818.

Additions and Changes

  1. DataProvider::finalize() interface for end-of-run cleanup activities that need to be sequenced more specifically than destructor-call order
    1. NetCDFPerFeatureProvider::finalize() implementation of DataProvider interface to close files
    2. Make no-op nature of WrappedDataProvider::finalize() explicit with explanatory comment
    3. CsvPerFeatureForcingProvider: delete unused private members
  2. Move HY_CatchmentRealization::forcing member down to Catchment_Formulation, and remove constructor argument pass-through from HY_CatchmentArea accordingly.
  3. Catchment_Formulation::finalize() to call finalize() on its forcing DataProvider
  4. Formulation_Manager::finalize to call finalize() on all Catchment_Formulation-descendant formulations that it owns
  5. Formulation_Manger call to clean up cache of shared NetCDFPerFeatureProvider instances moved from its destructor to its new finalize()
  6. Main code calls Formulation_Manager::finalize(), to get calls down the stack

Testing

Run ngen with mpirun. Tests show that the fix removes the NetCDF file close error at the end of the mpirun using NetCDF forcing.

Checklist

Target Environment support

PhilMiller commented 4 months ago

@hellkite500 @donaldwj I rewrote the code changes, so I probably shouldn't also review it when it's ready.

PhilMiller commented 4 months ago

(@stcui007 I just rebased and force-pushed)

stcui007 commented 4 months ago

The rebased and revised version is tested to work.

PhilMiller commented 4 months ago

Per discussion with @robertbartel, WrappedDataProvider can probably rely on whatever other object owns the provider instance to which it points to call finalize() on it. Thus, I'm adding a no-op implementation there.

PhilMiller commented 4 months ago

In reviewing this PR, I think the hard thing will not be deciding whether the present additions and changes are correct, but whether there are related things that are missing.

PhilMiller commented 4 months ago

I would try testing on macOS, except my Homebrew-installed NetCDF presents linking errors.

PhilMiller commented 4 months ago

Ok, I didn't try to reproduce the original failure, but I can at least get a clean build and test run on macOS, by brew uninstall netcdf-cxx and using our built-in NetCDF C++ bindings instead.

robertbartel commented 4 months ago

@PhilMiller, I'm curious why you didn't make finalize a pure virtual in DataProvider.

PhilMiller commented 4 months ago

I gave a concrete empty implementation of DataProvider::finalize() because I expect that most implementations of DataProvider will not need to do anything.

Note that the multiple inheritance of class Catchment_Formulation : public Formulation, public HY_CatchmentArea is a bit of a nuisance here - Formulation inherits DataProvider, while HY_CatchmentArea inherits HY_CatchmentRealization, which owns a provider instance that potentially need to be finalized. So, really, the void finalize() override implementations are in a few places ambiguous which one it's actually supposed to cover. I'll see if I can make that explicit, either in syntax, or failing that, in comment.

PhilMiller commented 4 months ago

@stcui007 Could you test this code with a realization config that uses the NetCDF forcings, MPI, and a BMI Multi realization? I think that may still fail, since I don't think the implementation of Bmi_Multi_Formulation::finalize() I wrote will actually get called.

stcui007 commented 4 months ago

Will do.

On Fri, May 24, 2024 at 9:10 PM Phil Miller - NOAA @.***> wrote:

@stcui007 https://github.com/stcui007 Could you test this code with a realization config that uses the NetCDF forcings, MPI, and a BMI Multi realization? I think that may still fail, since I don't think the implementation of Bmi_Multi_Formulation::finalize() I wrote will actually get called.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-OWP/ngen/pull/819#issuecomment-2130665813, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA4SRMG77TPHUP7D4PQWRTZD7XIZAVCNFSM6AAAAABHRFGQM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZQGY3DKOBRGM . You are receiving this because you were mentioned.Message ID: @.***>

PhilMiller commented 4 months ago

Setting to draft mode until that further testing is complete, so that it doesn't get merged inadvertently.

stcui007 commented 4 months ago

The code has been tested again with MPI and NetCDF forcing and BMI realization config (all from ngen/data/ directory) and it still runs to completion without error. Prior to this PR, this combination produced a NetCDF file close error.

PhilMiller commented 4 months ago

Donald, thanks for reminding me to update the PR description. That's done, and PR is once again 'Ready'

donaldwj commented 4 months ago

Phil,

The pull request is merged

On Thu, May 30, 2024 at 12:13 PM Phil Miller - NOAA < @.***> wrote:

Donald, thanks for reminding me to update the PR description. That's done, and PR is once again 'Ready'

— Reply to this email directly, view it on GitHub https://github.com/NOAA-OWP/ngen/pull/819#issuecomment-2140320218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6KABBVOX7BOYLLFDMOLQDZE5M3PAVCNFSM6AAAAABHRFGQM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBQGMZDAMRRHA . You are receiving this because you were mentioned.Message ID: @.***>

-- Donald W Johnson 205-347-1467 National Water Center Tuscaloosa AL