Temporal dimension in calc_<> functions

eva0marques commented 3 months ago

I am writing process and calc functions for other covariates that I need in my own project. I would like to open a discussion on the spatio x temporal case.

Let's say I want to create a model of AI to predict temperature at several locs x timestamps. I need to extract spatial covariates (easy) but also spatio x temporal ones.

In my ideal world, to do so:

I create a SpatVector or data.frame or sf/sftime with both geometry and time columns to give as locs param
I use calc_ functions to add columns for each covariate (they can be spatial or spatiotemporal). The calc_ functions for spatio-temporal covariates handle the "time" dimension properly, depending on the user's criteria (for eg: if geophysical model outputs are available every 3 days, and my predictions are every day: calc_ downscales the temporal resolution. It can also do the opposite if I have hourly data).

It would look like this:

my_spacetime_sample |>
  calc_era5() |>
  calc_nlcd() |>
  calc_gmted() |>
  ...

For now, calc_ functions are not optimally designed for temporal dimension. It is implied that locs is a spatial dataframe without time column. When calculating spatio-temporal covariates, it extracts all the time series of from. But if locs already has a time column (for eg created after calculating another spatio-temporal covariate), it becomes a mess.

As a summary, I see the following limitations with our current version of calc_ :

we cannot use several calc_ functions in a row (I mean give the output of a calc function to the input of another calc function) after dealing with spatio-temporal covariates
unlike spatial dimension, temporal dimension is not fine-tuned in the extracting process
user still has a lot of work to do in order to merge all covariates in a single spatio-temporal table, especially when covariates are not timely indexed in the same way.

It is not urgent of course, but I think it would be interesting to address this discussion in the future for a better use of amadeus.

eva0marques commented 3 months ago

My suggestions to improve this situation:

add time_column parameter in calc_ functions for spatio-temporal covariates (narr, geos, hms, gridmet, terraclimate). It would be a character designating the time column in locs.
check that time_column exists in locs (I would also rename locs by sample or points or something more general rather than explicitly spatial) and that the data format is correct (POSIXCT with date and time for eg)
add a parameter for time extraction preference (nearest, downscale, mean, median, precedent, following...)
create a function to extract at time stamp with the corresponding way
1. extract all timeseries at each loc
2. create and use function find_time(time_pts, time_cov, method)
3. for each loc * time : extract the value of the corresponding covar date.

mitchellmanware commented 2 months ago

Temporal summaries + download inputs
Data frame "inflation" for static spatial variables
Syncoronize calc_* functions where output from calc_1 is used as locs in calc_2
```
calc_1() |>
calc_2() |>
calc_3()
```

eva0marques commented 2 months ago

In calc_ pipes it would be easier to distinguish spatiotemporal points from spatial points 🤔 (eventually include the inflate function from spatial pipe to spatiotemporal one):

If the goal is to create a datatable to feed AI models:

my_spatial_sample |>
  calc_nlcd() |>
  calc_gmted() |>
  ... |>
  inflate_to_spatiotemporal(timestamps) |>
  calc_era5() |>
  calc_modis()
  ...

If the goal is to store efficiently the calculated points:

my_spatial_sample |>
  calc_nlcd() |>
  calc_gmted() |>
  ... |>
  writeRDS()

my_spatiotemporal_sample |>
  calc_era5() |>
  calc_modis() |>
  ... |>
  writeRDS()

mitchellmanware commented 2 months ago

I think an option is updating the static calc functions to have an inflate parameter. If inflate = TRUE it automatically returns a spatio-temporal data frame (feed AI models example) where if inflate = FALSE it is a list with a vector of dates and single spatial data frame (efficiency example).

Either way refactoring the calc_ functions to retain columns from the locs to use in a pipe should not be too difficult to add.

mitchellmanware commented 2 months ago

Something like this

if (inflate) {
  message("Returning a list with ... because inflate = TRUE")
  inflated <- merge(dates, data.frame, all = TRUE)
  return(inflated)
} else {
  message("Returning a data.frame with ... because inflate = FALSE")
  return(list(dates, data.frame))
}

eva0marques commented 2 months ago

Yes it is also an interesting solution. I would still make the inflate() function available to Amadeus users because they might be interested to use it separately. For eg, you store the non-inflated sample, reopen it, and use inflate function without recalculating everything.

sigmafelix commented 2 months ago

@eva0marques

Sorry I am late for the discussion. As @mitchellmanware suggested, I think that a hands-on solution by adding several lines into calc_return_locs with inflate argument added. One thing to consider is how "full" space-time combinations are inferred or furnished, which can be implemented by using a fixed set of field names (i.e., lon, lat, and time) or by adding additional argument for a full space-time combination templates (by using expand.grid, for example). I think the former is more of a hands-on solution since we easily utilize set operations to detect the common field names for determining what to join and to expand. I have added some functions to do this in beethoven already, so I'd be happy to make changes in functions at which we will agree to update to implement this functionality.

sigmafelix commented 2 months ago

As a side note, if we are aiming to make calc_* functions to be piped, the default value of inflate or the equivalent argument should be TRUE.

eva0marques commented 2 months ago

I've implemented my idea (my comment above) on my own project because it was the most optimized and flexible set up. It works pretty well, I'll be able to share my feedback if you are interested.

NIEHS / amadeus

Temporal dimension in calc_<> functions #112