CWorthy-ocean / roms-tools

Tools for creating input files for UCLA-ROMS simulations
https://roms-tools.readthedocs.io
Apache License 2.0
11 stars 5 forks source link

Change `.ds` property from `datatree` to `xarray.Dataset` #82

Closed NoraLoose closed 2 months ago

NoraLoose commented 3 months ago

After yesterday's discussion we landed on the conclusion that we should move away from datatree for the .ds property and instead use an xr.Dataset that includes both physical and BGC variables.

However, there is the complication that both the physical and BGC boundary variables need a dimension and coordinate bry_time (since this is hardcoded in ROMS), but the physical vs. BGC bry_time have different lengths and hold different frequencies. Given this complication, I can see three options for how to move forward.

Option A

The object

boundary_forcing = BoundaryForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    physics_source={"name": "GLORYS", "path": path},
    bgc_source={"name": "CESM_REGRIDDED", "path": path, "climatology": True},
)

has a single attribute .ds that holds both physical and BGC variables. To avoid a name clash, the physical variables have the dimension and coordinate bry_time, whereas the BGC variables have the dimension and coordinate bry_time_bgc. The .save() method will then split the physical and BGC variables into two separate xarray.Datasets and rename bry_time_bgc to bry_time before saving to NetCDF.

Option B

The object

boundary_forcing = BoundaryForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    physics_source={"name": "GLORYS", "path": path},
    bgc_source={"name": "CESM_REGRIDDED", "path": path, "climatology": True},
)

has two attributes, .ds and .ds_bgc. Each of these holds an xarray.Dataset with the physical and BGC variables, respectively. Both of these datasets can have the bry_time dimension and coordinate, and the .save() method will just save these two datasets to two separate NetCDF files.

Note that with this option, if the user skips the physics_source, i.e., does

boundary_forcing = BoundaryForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    bgc_source={"name": "CESM_REGRIDDED", "path": path, "climatology": True},
)

there won't be an attribute .ds (as for the other classes like Grid, TidalForcing) but only an attribute .ds_bgc, which could potentially lead to confusion.

Option C

We force the user to process physical and BGC variables separately, i.e., via

physical_boundary_forcing = BoundaryForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    physics_source={"name": "GLORYS", "path": path},
)

bgc_boundary_forcing = BoundaryForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    bgc_source={"name": "CESM_REGRIDDED", "path": path, "climatology": True},
)

and then both physical_boundary_forcing and bgc_boundary_forcing have a single attribute .ds that holds the respective variables with dimension bry_time.

If the user tries to process both at the same time

boundary_forcing = BoundaryForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    physics_source={"name": "GLORYS", "path": path},
    bgc_source={"name": "CESM_REGRIDDED", "path": path, "climatology": True},
)

they will get an error that tells them to specify either physics_source or bgc_source, but not both at the same time. This may be confusing for the user because the behavior is opposite to what happens for the initial conditions. For the initial conditions you would not be able to treat physics_source and bgc_sourceseparately; if you only specified a bgc_source, but no physics_source, you would get an error.

It would be great to get your opinions on what option you think is best @dafyddstephenson @TomNicholas @ubbu36 @abigale-wyatt.

Whatever option we decide on, I will implement the same for the surface forcing, even though we don't have the problem there that the time variable name is shared among physical and BGC variables. But I think it will be good to treat the SurfaceForcing class the same way as the BoundaryForcing class, to not confuse the user even more.

dafyddstephenson commented 3 months ago

Option B is how I imagined it but reading this I think I prefer C. I think my preference is purely biased towards my experience with MATLAB roms-tools though, I'm not sure I can channel the perspective of a first-time user.

If we were to go B though I wonder if it makes sense for a single .ds to return a list of datasets rather than having a separate.ds and .ds_bgc? Or perhaps some combination of B and C where the final example you present here (with two X_source parameters) could return a list, but one X_source parameter defaults to returning one dataset? If we assume widespread adoption in the ROMS community, I imagine most users will only be interested in physics modelling.

TomNicholas commented 3 months ago

If we were to go B though I wonder if it makes sense for a single .ds to return a list of datasets rather than having a separate.ds and .ds_bgc?

If you're going to have .ds return anything other than xr.Dataset, you may as well have it return a DataTree.

I think I prefer C. Conceptually there are two different boundary forcing datasets needed to force the simulation, so you have to use the BoundaryForcing class twice. That seems reasonable to me. The only weird part about it is physics_source vs bgc_source, as opposed to e.g. source= and type={'physics', 'bgc'}.

ubbu36 commented 3 months ago

I'm also leaning towards C I think. One question I had reading through this is what happens in option A and B if the user wants to set up a physics-only run?

NoraLoose commented 3 months ago

Thanks for your comments! I think I also favor Option C!

The only weird part about it is physics_source vs bgc_source, as opposed to e.g. source= and type={'physics', 'bgc'}.

True, let's do

boundary_forcing = BoundaryForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    source={"type": "physics", "name": "GLORYS", "path": path},
    #source={"type": "bgc", "name": "CESM_REGRIDDED", "path": path, "climatology": True}  # alternatively
)

and

surface_forcing = SurfaceForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    source={"type": "physics", "name": "ERA5", "path": path},
    #source={"type": "bgc", "name": "CESM_REGRIDDED", "path": path}  # alternatively
)

@TomNicholas What naming convention do you recommend for the initial conditions, though? Currently, we have

surface_forcing = SurfaceForcing(
    grid=grid,
    start_time=start_time,
    end_time=end_time,
    physics_source={"name": "GLORYS", "path": path},
    bgc_source={"name": "CESM_REGRIDDED", "path": path, "climatology": True}  # optional
)

Keep the parameter names physics_source and bgc_source? Note that for the initial conditions we need to process these two together if BGC is to be included.