NIEHS / amadeus

https://niehs.github.io/amadeus/
Other
7 stars 2 forks source link

Temporal dimension in calc_<> functions #112

Open eva0marques opened 3 months ago

eva0marques commented 3 months ago

I am writing process and calc functions for other covariates that I need in my own project. I would like to open a discussion on the spatio x temporal case.

Let's say I want to create a model of AI to predict temperature at several locs x timestamps. I need to extract spatial covariates (easy) but also spatio x temporal ones.

In my ideal world, to do so:

  1. I create a SpatVector or data.frame or sf/sftime with both geometry and time columns to give as locs param
  2. I use calc_ functions to add columns for each covariate (they can be spatial or spatiotemporal). The calc_ functions for spatio-temporal covariates handle the "time" dimension properly, depending on the user's criteria (for eg: if geophysical model outputs are available every 3 days, and my predictions are every day: calc_ downscales the temporal resolution. It can also do the opposite if I have hourly data).

It would look like this:

my_spacetime_sample |>
  calc_era5() |>
  calc_nlcd() |>
  calc_gmted() |>
  ...

For now, calc_ functions are not optimally designed for temporal dimension. It is implied that locs is a spatial dataframe without time column. When calculating spatio-temporal covariates, it extracts all the time series of from. But if locs already has a time column (for eg created after calculating another spatio-temporal covariate), it becomes a mess.

As a summary, I see the following limitations with our current version of calc_ :

It is not urgent of course, but I think it would be interesting to address this discussion in the future for a better use of amadeus.

eva0marques commented 3 months ago

My suggestions to improve this situation:

mitchellmanware commented 2 months ago
  1. Temporal summaries + download inputs
  2. Data frame "inflation" for static spatial variables
  3. Syncoronize calc_* functions where output from calc_1 is used as locs in calc_2
    calc_1() |>
    calc_2() |>
    calc_3()
eva0marques commented 2 months ago

In calc_ pipes it would be easier to distinguish spatiotemporal points from spatial points 🤔 (eventually include the inflate function from spatial pipe to spatiotemporal one):

If the goal is to create a datatable to feed AI models:

my_spatial_sample |>
  calc_nlcd() |>
  calc_gmted() |>
  ... |>
  inflate_to_spatiotemporal(timestamps) |>
  calc_era5() |>
  calc_modis()
  ...

If the goal is to store efficiently the calculated points:

my_spatial_sample |>
  calc_nlcd() |>
  calc_gmted() |>
  ... |>
  writeRDS()

my_spatiotemporal_sample |>
  calc_era5() |>
  calc_modis() |>
  ... |>
  writeRDS()
mitchellmanware commented 2 months ago

I think an option is updating the static calc functions to have an inflate parameter. If inflate = TRUE it automatically returns a spatio-temporal data frame (feed AI models example) where if inflate = FALSE it is a list with a vector of dates and single spatial data frame (efficiency example).

Either way refactoring the calc_ functions to retain columns from the locs to use in a pipe should not be too difficult to add.

mitchellmanware commented 2 months ago

Something like this

if (inflate) {
  message("Returning a list with ... because inflate = TRUE")
  inflated <- merge(dates, data.frame, all = TRUE)
  return(inflated)
} else {
  message("Returning a data.frame with ... because inflate = FALSE")
  return(list(dates, data.frame))
}
eva0marques commented 2 months ago

Yes it is also an interesting solution. I would still make the inflate() function available to Amadeus users because they might be interested to use it separately. For eg, you store the non-inflated sample, reopen it, and use inflate function without recalculating everything.

sigmafelix commented 2 months ago

@eva0marques

Sorry I am late for the discussion. As @mitchellmanware suggested, I think that a hands-on solution by adding several lines into calc_return_locs with inflate argument added. One thing to consider is how "full" space-time combinations are inferred or furnished, which can be implemented by using a fixed set of field names (i.e., lon, lat, and time) or by adding additional argument for a full space-time combination templates (by using expand.grid, for example). I think the former is more of a hands-on solution since we easily utilize set operations to detect the common field names for determining what to join and to expand. I have added some functions to do this in beethoven already, so I'd be happy to make changes in functions at which we will agree to update to implement this functionality.

sigmafelix commented 2 months ago

As a side note, if we are aiming to make calc_* functions to be piped, the default value of inflate or the equivalent argument should be TRUE.

eva0marques commented 2 months ago

I've implemented my idea (my comment above) on my own project because it was the most optimized and flexible set up. It works pretty well, I'll be able to share my feedback if you are interested.