NOTE:

EarlyStopping is currently not working because I haven't created a train/validation test set

Create xy samples dynamically from Data loaded into memory

sorry this is a huge PR where we have basically re-written the Engineer/DataLoaders/Models to work with data loaded into memory. Better for hard disk constrained modelling problems where the size of the seq_length is larger (e.g. 365 daily timesteps as input to the LSTM models).

Use the Pipeline for working with runoff data.

data is 2D instead of 3D (station_id, time)
data is on smaller timesteps than monthly (daily)
create dynamic engineer
create dynamic dataloader
update the EALSTM / Neural Networks to work with DynamicDataLoaders
new arguments to models = 'seq_length', 'target_var', 'forecast_horizon'

We have created an experiment file for running the OneTimestepForecast Runoff modelling: scripts/experiments/18_runoff_init.py

Analysis updates

We have added some updates to the analysis code:

overview: update all rmse/r2 functions to calculate spatial scores (score for each spatial unit) and temporal scores (time series of each station)
add more catching of the inversion problem (turns out it occurs when the order of lat, lon is reversed -> lon, lat

Engineer updates

Create new engineer OneTimestepForecast - src/engineer/one_timestep_forecast.py
Created a new DynamicEngineer for use with the DynamicDataLoader NOTE do we want this or do we ideally want to generalise the one_month_forecast?
Major difference is collapsing things not by lat, lon but by dimension_name = [c for c in static_ds.coords][0]

DataLoader Updates

self.get_reducing_dims to get the spatial dimensions (either latlon or area or station_id or whatever is not time!)
aggregations collapse over these reducing dimensions global_mean = x.mean(dim=reducing_dims)
build_loc_to_idx_mapping building a dictionary to ensure we can track what id relates to what spatial unit
Various examples of if len(static_np.shape) == 3: having to account for 2D spatial information (time, lat, lon) or 1D spatial information (time, station_id)

TODO: # TODO: why so many static nones?

This is because the standard deviation of some of the values, stored in the normalizing_dict become 0, so dividing by 0 we get np.nan

Model updates

seq_length // include_timestep_aggs
use a dataloader for the load in timesteps for x, y in tqdm.tqdm(train_dataloader):
include_monthly_aggs -> include_timestep_aggs = spatial aggregation (map of mean values for that pixel)

ECMWFCode4Earth / ml_drought