SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
625 stars 280 forks source link

Constrained load on MeshCoord bypasses callback #5939

Open jrackham-mo opened 2 months ago

jrackham-mo commented 2 months ago

🐛 Bug Report

When applying a constraint on a MeshCoord and a callback which checks the loaded cube has a mesh, the callback is not applied on the returned cube.

How To Reproduce

Steps to reproduce the behaviour:

  1. Define a custom callback function which raises an exception if a cube does not have a mesh.
  2. Attempt to load UGrid data with a constraint on latitude (a MeshCoord) and pass the custom callback.
  3. The cube is loaded, without a mesh. The MeshCoords have been converted to AuxCoords as expected, but no exception was raised.

Expected behaviour

I expected an exception to have been raised by the custom callback function, since the returned cube has no mesh.

Reproducible example

import iris
from iris.experimental.ugrid import PARSE_UGRID_ON_LOAD
import iris.exceptions
from iris import sample_data_path

def ugrid_load(uris, constraints=None):
    """Load a UGrid file. Will not load if the file contains regular data.

    Parameters
    ----------
    uris : Any
        Location of file to load. Must be a NetCDF file.
    constraints: Any | None
        One or more iris constraints. Each constraint can be either a string,
        or an instance of :class:`iris.Constraint`. If the constraint is a string
        it will be used to match against cube.name().

    Returns
    -------
    iris.cube.CubeList
        An iris :class:`iris.cube.CubeList` containing the loaded data.

    Raises
    ------
    iris.exceptions.InvalidCubeError
        If UGrid data cannot be loaded from the file.
    """
    with PARSE_UGRID_ON_LOAD.context():
        ugrid_cubelist = iris.load(uris=uris, constraints=constraints, callback=mesh_check)
    return ugrid_cubelist

def mesh_check(cube, field, filename):
    """Check if the loaded data is UGrid or not.

    A function to be passed to callback in iris.load.

    Parameters
    ----------
    cube : iris.cube.Cube
        Loaded :class:`iris.cube.CubeList` to be checked
    field : Any
        Only present to satisfy the function signature used by callback.
    filename : Any
        Original file from which the cube data was loaded

    Raises
    ------
    iris.exceptions.InvalidCubeError
        If the loaded cube does not contain UGrid data
    """
    if cube.mesh is None:
        raise iris.exceptions.InvalidCubeError(
            f"Attempting to load from file '{filename}' did not return UGrid data."
        )

if __name__ == "__main__":
    latitude_constraint = iris.Constraint(latitude=lambda cell: cell > 0)

    # Try with non-UGrid netcdf first to confirm it works
    non_ugrid_path = sample_data_path("atlantic_profiles.nc")
    try:
        ugrid_load(non_ugrid_path, latitude_constraint)
    except iris.exceptions.InvalidCubeError as exc:
        print(exc)

    # Now try constraining on a MeshCoord on load of a file containing UGrid data
    # Expected to raise an exception since the returned cube has no mesh due to the constraint
    # having converted the MeshCoords to AuxCoords
    mesh_C4_path = sample_data_path("mesh_C4_synthetic_float.nc")
    ugrid_cubelist = ugrid_load(mesh_C4_path, latitude_constraint)
    if len(ugrid_cubelist) != 0:
        print(f"Data successfully loaded from '{mesh_C4_path}' with constraint '{latitude_constraint}'.")
        print(ugrid_cubelist[0])
        print(f"Cube's mesh is: {ugrid_cubelist[0].mesh}")

Output:

Attempting to load from file '.../iris_sample_data/sample_data/atlantic_profiles.nc' did not return UGrid data.
Data successfully loaded from '.../iris_sample_data/sample_data/mesh_C4_synthetic_float.nc' with constraint 'Constraint(coord_values={'latitude': <function <lambda> at 0x7fd980bc0310>})'.
synthetic / (1)                     (-- : 32)
    Auxiliary coordinates:
        latitude                        x
        longitude                       x
    Attributes:
        NCO                         'netCDF Operators version 4.7.5 (Homepage = http://nco.sf.net, Code = h ...'
        history                     'Mon Apr 12 01:44:41 2021: ncap2 -s synthetic=float(synthetic) mesh_C4_synthetic.nc ...'
        nco_openmp_thread_number    1
Cube's mesh is: None

Environment

pp-mo commented 1 month ago

I think the problem here is really the logic of what callbacks are for + what order they happen in. It's complicated because most constraints get applied after the callback, but certain special cases get "shortcut" to apply to the load process before the callback. ( specifically name constraints to speedup um files loading, and netcdf loading ) See : https://github.com/SciTools/iris/issues/3944

hdyson commented 1 month ago

So does this mean that using a callback to validate the data being loaded is a bad idea? Rather than "validate during load", should we doing an approach of "load then validate"?

pp-mo commented 1 month ago

So does this mean that using a callback to validate the data being loaded is a bad idea? Rather than "validate during load", should we doing an approach of "load then validate"?

Well I guess it depends what you're trying to achieve. The load callback sits in the "middle" of the load process, and is there principally to capture metadata changes to a "raw" cube at a point where its context in the original sourcefile is still available, and before raw cubes are merged and constrained. As it is described in the User Guide. ( Except for the "constraint shortcutting" just described that is 🤔 )

So maybe it isn't really appropriate for this purpose. What was the original thinking, why you do it this way ?

hdyson commented 1 month ago

What was the original thinking, why you do it this way ?

For our work on UG-ANTS, we want to validate that people are loading cubes with meshes for the applications that should be working with UGrid data, and that there is no mesh for applications that should be working with regular grid data.

I'd like the validation to be on load or at worst immediately after load: it gives us a chance to give a clear error message (a message saying "loading wrong type of data" is more useful than a message saying "tried to access a mesh but couldn't find it" - the first is definitely a load problem; the second could be a problem in subsequent processing), and it gives the error message before any expensive or time consuming processing has occurred.

A callback seemed a good way to ensure the load and the validation were "simultaneous" (I think it may have been yourself that suggested a callback?), but it's starting to look like the combination of callback and constraint is cryptic enough that we may want to go "load then validate" rather than "validate during load".