Closed elray1 closed 3 years ago
A couple other options I was coincidentally pondering a couple days ago, not sure they have advantages over the above though:
Thanks @elray1 for jumping in with this great suggestion. I like many aspects of this proposal!
I'm just not sure I see yet a reason to have this completely replace the current epi_signal
format, rather than be a viable alternative format.
I see advantages to both. The disadvantages of the current format are the ones you listed.
The disadvantage of the format that you're suggesting is that the column names themselves may have to contain info in some applications that you may want to use downstream, and extracting this info from the column names could be brittle. For example, for the simple AR forecaster which outputs three predictive quantiles, we'd have column names like quantile_level_0.1
, quantile_level_0.5
, and quantile_level_0.9
, i.e., the quantile level is here part of the name. (Or the quantile level is stored in the metadata?) So extracting the quantile level for downstream use might be even more work than it would in the original list column format.
So here is a suggestion about we could proceed.
epi_signal
format remains as it is currently, and value
is a list column if you have multiple signals you want to track. It's the "tersest" format. tidyr::unnest()
on the value
column gives you an object in either epi_signal_long
format or epi_signal_wide
format, depending on the dimensions of the entries in the value
column (if each list entry is a data frame of length 1, it's wide; otherwise it's long). You could move in between these using dplyr::pivot_wider()
and dplyr::pivot_longer()
.epi_signal_long
format would give you two columns (say) quantile_value
and quantile_level
.epi_signal_wide
format would give you three columns (say) quantile_level_0.1
, quantile_level_0.5
, and quantile_level_0.9
.Instead of subjecting the user to write their own wrangling code to go in between formats 1 and 2a and 2b, we could just write some convenience functions (based on tidyr::unnest()
, tidyr::pivot_wider()
, etc.). These convenience functions would also populate the metadata appropriately, so that, e.g., in wide format the names of the signal columns get stored in the metadata.
The advantage to supporting formats 1 and 2a and 2b is flexibility. If you happen to like format 2b and I happen to like format 1, and they're both supported and we can easily transform back and forth between them, then we're both happy. The disadvantage is maintenance cost. If we have multiple formats that we support then it's more work to make sure that the package's functions work with all formats ...
What do you think?
After having slept on it, I think I already dislike my suggestion of having these three types purely for maintenance reasons. It's going to be a lot of work to support epi_signal_long
in particular.
Also, your suggested data structure, epi_signal_wide
in my nomenclature, is actually just purely more general than the current epi_signal
format, because if we still allow for list columns, then of course I could still define a column called value
with all the things I want in it.
So my general take at this point is that we should go with the structure you're suggesting (but just keep it called epi_signal
), allow for list columns, and if the user wants to play with long data frames, then they can do it on their own accord but that won't be an official data type in epitools
.
In the spirit of moving fast 🙂, I'm thinking about just implementing this later today. Any reactions @elray1 @dajmcdon @jacobbien? (I would tag Logan but I'm hoping he's actually going to stop reading GitHub these next few days and take some time off!).
This sounds reasonable to me. A small hanging piece is whether we'd want to somehow track metadata for different signals in a list column (e.g. via a nested list?) so that they could be "unpacked" correctly when unnesting the column. But my instinct is to not worry about that for now.
Of the various formats, epi_signal_wide
seems like the least avoidable:
cor
are easier to applyIn my current forecast exploration, I have also been trying to use the nested format. I do prefer this over the unnamed-list-of-signal-df's format as filtering/joining to particular signals is easier. Its purpose currently seems to be just to get from a covidcast API response (list-of-dfs format) to the wide format, applying lags to covariates along the way. (E.g., we can specify the desired covariates along with the time shifts to apply in a tibble such as the following, and left-join it to the nest format to check that the desired signals are all there and to extract the desired features:
covariate.lags = tribble(
~data_source, ~signal, ~shift,
"jhu-csse", "confirmed_incidence_num", 1:3,
"jhu-csse", "confirmed_incidence_prop", 1:5,
)
) But lags can be applied in the wide format as well. It's just that specifying them now would be a little awkward because one must rely on a certain naming convention and transform things like the tibble above to use such a naming convention. With the wide+metadata idea from Evan, maybe we could make it work more naturally.
One use case that is always a bit awkward for me with wide data, though, is ggplotting. I am still not comfortable with the new pivot_longer
and to have to pivot-filter-plot, then plot is a pain. This might be an argument for having a long df format (although it's space-inefficient), or maybe just having some sort of specialized plot function that allows one to tidyselect the value columns to ggplot, and perform select-pivot-ggplot. Or maybe an argument for me to just get good with tidyr+ggplot.
A trivial yet important issue: naming. epi_signal_wide
seems sort of like a misnomer, and having epi_signal
be a wide-ish df also is weird (I would expect a single value column for an epi_signal
).
If there were still a class epi_signal
, then it'd make sense to call the wide format epi_signals_wide
. But if we want just a single format, I'm drawing a blank on a more descriptive name. Any ideas?
I agree completely with everything @brookslogan says above about the wide format for forecasting. There are a handful of functions inside modeltools
that go from the list of data.frame
s format from covidcast
to x,y for a forecaster. The advantage of these is that their test coverage is good. The disadvantages are that they're a bit brittle, add a lot of complexity to the process of creating a forecaster, and they assume a lot about data formatting. But maybe worth checking out (both to see how much of a pain this is, and in case you need to borrow some of it):
Both assume that you have applied covidcast::aggregate_signals(dt = lags, format = "wide")
.
add_response_columns()
Creates leads for multiple aheads and names them response-
for easy access. Still a wide_df
create_train_and_predict_matrices()
Grabs the appropriate rows for x
and y
for the desired targets and amount of training data. Also creates newx
for predictions.Thanks for the pointers @dajmcdon. Yes, these functions are quite handy, but I expect some of the brittle-ness to come into play with the hhs data set.
aggregate_signals
keeps the time_value
set for each df in the list unchanged when applying lags, then joins them together. This might introduce some spurious NAs when the time_value
sets differ. I'm not sure this is a big deal though.create_train_and_predict_matrices
+ animalia::production_forecaster
or modeltools::quantgen_forecaster
seem to have an issue with getting the right test instance dataSo I was trying to patch things up and make the forecaster's expectations more explicit, and it seemed easier to move to the nested format and try to make similar functions that may eventually become PRs to the originals or something to put in epipred
.
Yes thanks for all the helpful points @brookslogan! And welcome back 🙂.
In my PR #14, I've already moved everything to wide format, but without distinguishing it from the original class. Now everything's called epi_signal
. The point about naming is well-taken. How about epi_tibble
?
By the way, as a follow-up comment, I don't know that it's super helpful to actually distinguish between wide and long formats, necessarily, in terms of object types. An epi_tibble
is a thing that has a geo_value
column, time_value
column, and then a bunch of other variables (can be considered as measured variables).
The user has the ability "do stuff" to it, with the utility functions. These utilities do not explicitly check for whether of not every row is uniquely determined by geo_value
and time_value
(as they would have to in "wide" format). Should they? I'm not sure.
I think I would rather write clear documentation and leave it up to the user to make sure that the intended application makes sense. And provide them with plenty of examples via vignettes of wrangling before and after applying the utilities (and maybe some extra utilities for convenient wrangling).
I would also like to provide them the ability to modify the utilities so that they group on more than just geo_value
(as they might have to in "long" format).
What do you think? This is my current plan.
Just a heads up that I'm going to close this with #14.
A common use case will be models that want to interact with multiple signals, e.g., cases, hospitalizations, deaths, and perhaps other covariates like mobility data. It would be ideal if this package supported this use case in a complete way. I see two concerns at the moment:
Currently, the package requires a column to be named
value
, with the name of the signal stored as metadata. This is slightly awkward in a couple of ways:value
is not a very meaningful term; we have to inspect the metadata to figure out what it is the data frame is tracking, rather than just looking at the column names.Here's a proposal:
value
value_names
, a vector of names of columns in the data frame that contain signals of interest. This is required to have length at least 1value_names
. This has the advantage of optionally explicitly tracking things like issue dates and signal types separately for each signal in the data set. (For example, tracking issue dates separately may be beneficial if I'm modeling signals that don't all have regular release schedules that coincide on the same day.)