Add mean, min, and max to microclimatic variables

vgro commented 3 months ago

This PR introduces mean, minimum, and maximum variables for key microclimatic variables (air temperature, rh, vpd, soil temperature). Please let me know if that works for you before we roll it out to the whole model.

An alternative would be to add a new dimension to the variables which can be accessed via coordinates however, that seemed a bit messy.

Type of change

[x] New feature (non-breaking change which adds functionality)
[ ] Optimization (back-end change that speeds up the code)
[ ] Bug fix (non-breaking change which fixes an issue)

Key checklist

[x] Make sure you've run the pre-commit checks: $ pre-commit run -a
[ ] All tests pass: $ poetry run pytest

Further checks

[x] Code is commented, particularly in hard-to-understand areas
[x] Tests added that prove fix is effective or that feature works

vgro commented 3 months ago

Hi @alexdewar , I have started rolling this idea out but it gets very messy very quickly when I come to the mechanistic model. Basically all variables need to be times three if I want to do that which seems a bit impractical. Do you have a suggestion?

alexdewar commented 3 months ago

Is there a way you could make mean, min and max into another dimension (not sure if this is possible with xarray)? This way you'd have my_array["my_variable"]["mean"] etc. Alternatively, I imagine there's a way to get all dimensions starting with a particular string, which might be cleaner, but does feel a bit hacky...

Tagging @dalonsoa because he knows xarray a lot better than I do.

davidorme commented 3 months ago

@vgro Can you post an example of the issue you're having: it's a bit abstract at the moment?

vgro commented 3 months ago

Is there a way you could make mean, min and max into another dimension (not sure if this is possible with xarray)? This way you'd have my_array["my_variable"]["mean"] etc. Alternatively, I imagine there's a way to get all dimensions starting with a particular string, which might be cleaner, but does feel a bit hacky...

Tagging @dalonsoa because he knows xarray a lot better than I do.

I thought about this, too (see top comment), but we discussed it and thought it might be confusing. However, at that point we were not aware of the extent of the issue.

vgro commented 3 months ago

@vgro Can you post an example of the issue you're having: it's a bit abstract at the moment?

For example in the abiotic model, temperatures and VPD are an input to the energy balance. In the function calculate_leaf_and_air_temperature for example we select the reference temperature and VPD at the start

# Select variables for current time step and relevant layers
    topsoil_temperature = data["soil_temperature"][topsoil_layer_index]
    topsoil_moisture = (
        data["soil_moisture"][topsoil_layer_index]
        / -data["layer_heights"][topsoil_layer_index]
        / core_constants.meters_to_mm
    )
    air_temperature_ref = data["air_temperature_ref"].isel(time_index=time_index)
    vapour_pressure_ref = data["vapour_pressure_ref"].isel(time_index=time_index)
    atmospheric_pressure_ref = data["atmospheric_pressure_ref"].isel(
        time_index=time_index
    )

and all following functions use that. So all the conductivities, fluxes, factors etc that are returned by the abiotic model would have to include mean min max.

Looking at it now after some time makes me think there is some tidying up to do as well (repetitions), maybe this could be combined

dalonsoa commented 3 months ago

I think I'm with @davidorme here. I think.

Having mean, max and min as coordinates of another dimension (eg. stats) could look better initially, but is no really right you are not planning to operate on the whole array at once ever since the different components on that dimension mean different things. Also, all the time you will need to select the right stat to use with something like air_temperature.sel(stat="mean") (or whatever is appropriate for each operation), which is not really better than just using air_temperature_mean directly.

This is, of course, if you are not planning to operate on the whole array at once. If you are, such that you carry all operations you need only once and get at the same time the results for mean, max and min, then it makes total sense to make it a separate dimension orthogonal to space and time as that will save you a lot of time and code duplication - or loops.

vgro commented 3 months ago

I think I'm with @davidorme here. I think.

Having mean, max and min as coordinates of another dimension (eg. stats) could look better initially, but is no really right you are not planning to operate on the whole array at once ever since the different components on that dimension mean different things. Also, all the time you will need to select the right stat to use with something like air_temperature.sel(stat="mean") (or whatever is appropriate for each operation), which is not really better than just using air_temperature_mean directly.

This is, of course, if you are not planning to operate on the whole array at once. If you are, such that you carry all operations you need only once and get at the same time the results for mean, max and min, then it makes total sense to make it a separate dimension orthogonal to space and time as that will save you a lot of time and code duplication - or loops.

I will always do all operations for mean/min/max, so it would be convenient if it was all packaged up, but I see that this is not very accessible. Would it make sense to keep them separate in data but stack for operations, for example at the start of the model update, and unstack them again when writing the output in the data object? It's this function which we have still as an extra step between result and data :

output_variables = microclimate.run_microclimate(
            data=self.data,
            layer_roles=self.layer_structure.layer_roles,
            time_index=time_index,
            constants=self.model_constants,
            bounds=self.bounds,
        )
self.data.add_from_dict(output_dict=output_variables)

dalonsoa commented 3 months ago

I'm not sure about that. Stacking and unstacking might be an expensive operation, specially if it needs to be done for multiple variables in every timestep.

If you're operating on the three stats at once, then I'd make that another dimension and include in the relevant variable description (in the other PR), a detailed description of what they include - as well as the new axes.

vgro commented 3 months ago

I'm not sure about that. Stacking and unstacking might be an expensive operation, specially if it needs to be done for multiple variables in every timestep.

If you're operating on the three stats at once, then I'd make that another dimension and include in the relevant variable description (in the other PR), a detailed description of what they include - as well as the new axes.

haha ok, I guess we have come full circle with this :-) I think I would prefer the additional axis because it would safe a lot of extra variables and loops, but I also understand David's argument re readability. Since this is a rather technical task, would either of you @alexdewar @dalonsoa be willing to draft this?

davidorme commented 3 months ago

This is really tricky. Some thoughts/questions.

I think we really need to avoid expensive and frequent un/stacking operations, which is a big argument towards having an extra dimension. But...
If we define these variable names as having a stats dimension, then every model needs to be able to handle those, which is quite an overhead in complexity and indexing versus different names for each stat.
There is then a structural difference between stats variables and other variables: developers have to keep track of which is which.
Not all models need every stat - even within the same run - so there's extra complexity in the checking process. How does the new variables system validate the data loading? The answer is probably that we need to adjust the required_xyz_variables to include dimension index checks: something like ('air_temperature_ref', {'stats': ['mean', 'min', 'max']}).

So - I can see tripling the variable name space is a huge pain in the arse, but it is entirely within the scope of the existing architecture. Having an extra dimension feels like something that requires widespread refactoring.

davidorme commented 3 months ago

What's the big picture, @vgro? Why are we adding the min and max - what are you anticipating that will then use this estimated values? That's more of a prioritisation question than 'why bother' question! They're important variables but are they on the critical path to your next feature?

vgro commented 3 months ago

What's the big picture, @vgro? Why are we adding the min and max - what are you anticipating that will then use this estimated values? That's more of a prioritisation question than 'why bother' question! They're important variables but are they on the critical path to your next feature?

Both the animal and soil model are interested in mean and maximum temperatures, means for general dynamics and max because they are more limiting for the dynamics than the mean values. For the simple model, this means we put in external monthly mean, monthly mean max, and monthly mean min and run the model for the three 'scenarios' to give information about the diurnal range. Once we have the mechanistic abiotic model, we would like to run that (and maybe the plants) on a (sub-)daily timestep with (sub-)daily external inputs and then return some actual statistics to the soil and animals.

vgro commented 3 months ago

This is really tricky. Some thoughts/questions.

* I think we really need to avoid expensive and frequent un/stacking operations, which is a big argument towards having an extra dimension. But...

* If we define these variable names as having a `stats` dimension, then every model needs to be able to handle those, which is quite an overhead in complexity and indexing versus different names for each stat.

* There is then a structural difference between `stats` variables and other variables: developers have to keep track of which is which.

* Not all models need every stat - even within the same run - so there's extra complexity in the checking process. How does the new variables system validate the data loading? The answer is probably that we need to adjust the `required_xyz_variables` to include dimension index checks: something like `('air_temperature_ref', {'stats': ['mean', 'min', 'max']})`.

So - I can see tripling the variable name space is a huge pain in the arse, but it is entirely within the scope of the existing architecture. Having an extra dimension feels like something that requires widespread refactoring.

I would say every model can chose how they want to handle the stats. As discussed in the layer structure meeting, the soil will want some monthly stats along the z axis and calculate their own values for the relevant soil column, animals will want to look at mean and max temperatures in some way and for the plants it's temperature and vpd only (later maybe soil moisture) which you can handle as you like. I guess you would probably also need to run it with all three at the same time to return the required variables for the abiotic model, but that should not be complicated with numpy arrays?

I guess we would not have this problem if we were all on the same time step...

davidorme commented 3 months ago

I would say every model can chose how they want to handle the stats.

I don't disagree at all but having that stats axis does change the API across the whole system and that's tricky.

If this PR is moving us to a system where some variables always have to have min, mean and max, then we can implement a new axis validator to enforce those indexes. Then model users do just have to know about the axis requirement - that's not unreasonable and it will be documented 😄

What would be unworkable is if air_temperature_ref could sometimes be just the mean or could contain different sets of stats. I guess we could have alternative variables - like air_temperature_ref_mean and air_temperature_ref_stats - but that seems like a simply enormous bag of nope.

vgro commented 3 months ago

I would say every model can chose how they want to handle the stats.

I don't disagree at all but having that stats axis does change the API across the whole system and that's tricky.

If this PR is moving us to a system where some variables always have to have min, mean and max, then we can implement a new axis validator to enforce those indexes. Then model users do just have to know about the axis requirement - that's not unreasonable and it will be documented 😄

What would be unworkable is if air_temperature_ref could sometimes be just the mean or could contain different sets of stats. I guess we could have alternative variables - like air_temperature_ref_mean and air_temperature_ref_stats - but that seems like a simply enormous bag of nope.

I guess if some models build on the assumption of mean min and max being available, they should always be there for the required variables. They will always be input requirements in some way, for the simple abiotic we want monthly mean stats, for the mechanistic abiotic model we could follow other models and interpolate a diurnal cycle between the min and max, which would again require those as inputs. We could add something to fill the gaps if the data cannot be provided, like a diurnal range factor

davidorme commented 3 months ago

That sounds right - we could provide external data preparation tools to go from daily mean to VE ready inputs etc. Shouldn't need to be part of VE itself.

vgro commented 3 months ago

In an in person discussion, we have decided to go forward with the axis option. I will start that on a fresh branch.

ImperialCollegeLondon / virtual_ecosystem