Think about how to handle multiple signals

elray1 commented 3 years ago

A common use case will be models that want to interact with multiple signals, e.g., cases, hospitalizations, deaths, and perhaps other covariates like mobility data. It would be ideal if this package supported this use case in a complete way. I see two concerns at the moment:

Representation of the data/enforced use of a single value column
Metadata tracking signal characteristics

Currently, the package requires a column to be named value, with the name of the signal stored as metadata. This is slightly awkward in a couple of ways:

value is not a very meaningful term; we have to inspect the metadata to figure out what it is the data frame is tracking, rather than just looking at the column names.
as discussed on our call today, if I want to track multiple signals, there are two possible ways to do that: (a) by adding extra columns tracking the things I want. But this doesn't seem ideal because
- we essentially have to pick one of these columns as being "special" though there may not be reason to do so
- metadata would only describe the variable in the value column, not the other columns (b) by using a list column for the value column. But this doesn't seem ideal because:
- I think the user would need to unnest that list column before actually doing anything with the data like making plots or fitting models.
- there may not be enough flexibility in the metadata to track information about all the variables in the data set

Here's a proposal:

drop the requirement that a column be named value
instead, in the metadata track something like value_names, a vector of names of columns in the data frame that contain signals of interest. This is required to have length at least 1
Track all other metadata in a named list, with names matching the entries of value_names. This has the advantage of optionally explicitly tracking things like issue dates and signal types separately for each signal in the data set. (For example, tracking issue dates separately may be beneficial if I'm modeling signals that don't all have regular release schedules that coincide on the same day.)

brookslogan commented 3 years ago

A couple other options I was coincidentally pondering a couple days ago, not sure they have advantages over the above though:

make the metadata a tibble with columns equal to metadata (messier if we allow some signals to have different types of metadata than others). Maybe add helper functions allowing tidyselect on columns based on the column metadata.
use tibble-type column( group)s named based on the value of the most important metadata item, or nested tibble column groups based on multiple metadata items. Allows access in model formulas as e.g. signals$jhucsse$signal1
assign signal attributes to column attributes

ryantibs commented 3 years ago

Thanks @elray1 for jumping in with this great suggestion. I like many aspects of this proposal!

I'm just not sure I see yet a reason to have this completely replace the current epi_signal format, rather than be a viable alternative format.

I see advantages to both. The disadvantages of the current format are the ones you listed.

The disadvantage of the format that you're suggesting is that the column names themselves may have to contain info in some applications that you may want to use downstream, and extracting this info from the column names could be brittle. For example, for the simple AR forecaster which outputs three predictive quantiles, we'd have column names like quantile_level_0.1, quantile_level_0.5, and quantile_level_0.9, i.e., the quantile level is here part of the name. (Or the quantile level is stored in the metadata?) So extracting the quantile level for downstream use might be even more work than it would in the original list column format.

So here is a suggestion about we could proceed.

The epi_signal format remains as it is currently, and value is a list column if you have multiple signals you want to track. It's the "tersest" format.
Calling tidyr::unnest() on the value column gives you an object in either epi_signal_long format or epi_signal_wide format, depending on the dimensions of the entries in the value column (if each list entry is a data frame of length 1, it's wide; otherwise it's long). You could move in between these using dplyr::pivot_wider() and dplyr::pivot_longer().
a. In the simple AR example, the epi_signal_long format would give you two columns (say) quantile_value and quantile_level.
b. In the same example, the epi_signal_wide format would give you three columns (say) quantile_level_0.1, quantile_level_0.5, and quantile_level_0.9.

Instead of subjecting the user to write their own wrangling code to go in between formats 1 and 2a and 2b, we could just write some convenience functions (based on tidyr::unnest(), tidyr::pivot_wider(), etc.). These convenience functions would also populate the metadata appropriately, so that, e.g., in wide format the names of the signal columns get stored in the metadata.

The advantage to supporting formats 1 and 2a and 2b is flexibility. If you happen to like format 2b and I happen to like format 1, and they're both supported and we can easily transform back and forth between them, then we're both happy. The disadvantage is maintenance cost. If we have multiple formats that we support then it's more work to make sure that the package's functions work with all formats ...

What do you think?

ryantibs commented 3 years ago

After having slept on it, I think I already dislike my suggestion of having these three types purely for maintenance reasons. It's going to be a lot of work to support epi_signal_long in particular.

Also, your suggested data structure, epi_signal_wide in my nomenclature, is actually just purely more general than the current epi_signal format, because if we still allow for list columns, then of course I could still define a column called value with all the things I want in it.

So my general take at this point is that we should go with the structure you're suggesting (but just keep it called epi_signal), allow for list columns, and if the user wants to play with long data frames, then they can do it on their own accord but that won't be an official data type in epitools.

In the spirit of moving fast 🙂, I'm thinking about just implementing this later today. Any reactions @elray1 @dajmcdon @jacobbien? (I would tag Logan but I'm hoping he's actually going to stop reading GitHub these next few days and take some time off!).

elray1 commented 3 years ago

This sounds reasonable to me. A small hanging piece is whether we'd want to somehow track metadata for different signals in a list column (e.g. via a nested list?) so that they could be "unpacked" correctly when unnesting the column. But my instinct is to not worry about that for now.

brookslogan commented 3 years ago

Of the various formats, epi_signal_wide seems like the least avoidable:

feeding data into a regression fit requires data in this form, either to make x&y objects or in order to use model formulae
defining features in terms of >=2 signals is much more convenient in this form
defining transformations of individual signals is more convenient in this form
filtering to instances with non-missing data or other missingness handlers are easier to apply in this form
EDA stuff like cor are easier to apply

In my current forecast exploration, I have also been trying to use the nested format. I do prefer this over the unnamed-list-of-signal-df's format as filtering/joining to particular signals is easier. Its purpose currently seems to be just to get from a covidcast API response (list-of-dfs format) to the wide format, applying lags to covariates along the way. (E.g., we can specify the desired covariates along with the time shifts to apply in a tibble such as the following, and left-join it to the nest format to check that the desired signals are all there and to extract the desired features:

covariate.lags = tribble(
  ~data_source, ~signal, ~shift,
  "jhu-csse", "confirmed_incidence_num", 1:3,
  "jhu-csse", "confirmed_incidence_prop", 1:5,
)

) But lags can be applied in the wide format as well. It's just that specifying them now would be a little awkward because one must rely on a certain naming convention and transform things like the tibble above to use such a naming convention. With the wide+metadata idea from Evan, maybe we could make it work more naturally.

One use case that is always a bit awkward for me with wide data, though, is ggplotting. I am still not comfortable with the new pivot_longer and to have to pivot-filter-plot, then plot is a pain. This might be an argument for having a long df format (although it's space-inefficient), or maybe just having some sort of specialized plot function that allows one to tidyselect the value columns to ggplot, and perform select-pivot-ggplot. Or maybe an argument for me to just get good with tidyr+ggplot.

brookslogan commented 3 years ago

A trivial yet important issue: naming. epi_signal_wide seems sort of like a misnomer, and having epi_signal be a wide-ish df also is weird (I would expect a single value column for an epi_signal).

If there were still a class epi_signal, then it'd make sense to call the wide format epi_signals_wide. But if we want just a single format, I'm drawing a blank on a more descriptive name. Any ideas?

dajmcdon commented 3 years ago

I agree completely with everything @brookslogan says above about the wide format for forecasting. There are a handful of functions inside modeltools that go from the list of data.frames format from covidcast to x,y for a forecaster. The advantage of these is that their test coverage is good. The disadvantages are that they're a bit brittle, add a lot of complexity to the process of creating a forecaster, and they assume a lot about data formatting. But maybe worth checking out (both to see how much of a pain this is, and in case you need to borrow some of it):

Both assume that you have applied covidcast::aggregate_signals(dt = lags, format = "wide").

add_response_columns() Creates leads for multiple aheads and names them response- for easy access. Still a wide_df
create_train_and_predict_matrices() Grabs the appropriate rows for x and y for the desired targets and amount of training data. Also creates newx for predictions.

brookslogan commented 3 years ago

Thanks for the pointers @dajmcdon. Yes, these functions are quite handy, but I expect some of the brittle-ness to come into play with the hhs data set.

I believe aggregate_signals keeps the time_value set for each df in the list unchanged when applying lags, then joins them together. This might introduce some spurious NAs when the time_value sets differ. I'm not sure this is a big deal though.
create_train_and_predict_matrices + animalia::production_forecaster or modeltools::quantgen_forecaster seem to have an issue with getting the right test instance data

So I was trying to patch things up and make the forecaster's expectations more explicit, and it seemed easier to move to the nested format and try to make similar functions that may eventually become PRs to the originals or something to put in epipred.

ryantibs commented 3 years ago

Yes thanks for all the helpful points @brookslogan! And welcome back 🙂.

In my PR #14, I've already moved everything to wide format, but without distinguishing it from the original class. Now everything's called epi_signal. The point about naming is well-taken. How about epi_tibble?

ryantibs commented 3 years ago

By the way, as a follow-up comment, I don't know that it's super helpful to actually distinguish between wide and long formats, necessarily, in terms of object types. An epi_tibble is a thing that has a geo_value column, time_value column, and then a bunch of other variables (can be considered as measured variables).

The user has the ability "do stuff" to it, with the utility functions. These utilities do not explicitly check for whether of not every row is uniquely determined by geo_value and time_value (as they would have to in "wide" format). Should they? I'm not sure.

I think I would rather write clear documentation and leave it up to the user to make sure that the intended application makes sense. And provide them with plenty of examples via vignettes of wrangling before and after applying the utilities (and maybe some extra utilities for convenient wrangling).

I would also like to provide them the ability to modify the utilities so that they group on more than just geo_value (as they might have to in "long" format).

What do you think? This is my current plan.

ryantibs commented 3 years ago

Just a heads up that I'm going to close this with #14.

cmu-delphi / epiprocess

Think about how to handle multiple signals #10