more flexible specification of columns identifying units of analysis for epi_signal

cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology

https://cmu-delphi.github.io/epiprocess/

Other

13 stars 9 forks source link

more flexible specification of columns identifying units of analysis for epi_signal #12

Closed elray1 closed 2 years ago

elray1 commented 3 years ago

The main example I'm imagining supporting is flusurv hospitalizations, which report hospitalization rates for each combination of location and age group. For example, a user might want to be able to apply sliding functions to each series corresponding to one combination of these factors. I think supporting this could happen with two changes:

Allow users to specify something like unit_ids when creating an epi_signal. This should be a character vector of column names in the data set, and would be stored in the metadata. This could default to "geo_value" to match current behaviors. A user working with flusurv data might specify unit_ids = c("geo_value", "age_group").
Maybe rename slide_by_geo to slide_by_unit? And in that function change the group_by to group by the id columns stored in the metadata.

ryantibs commented 3 years ago

Yep, great idea, again @elray1. Thanks! Please check out my reply on #10, and see if you agree with where I ended up, then I can try to address the current suggestion in a refactor as well.

One note I forgot to mention on #10, applies equally-well here. I think making the structure more flexible as in #10 and the current issue is going to force us to forgo inheriting from tsibble, see #7. At least in the most general case (with multiple columns representing signals and units). I think tsibble has a single key and index. But I haven't read about it in detail. Maybe you could check out tsibble and see if you agree that it'd be too rigid? Would be good to know that we'd be forgoing that if we do indeed refactor as suggested here and on #10.

elray1 commented 3 years ago

It looks like the key in tsibble can support multiple columns, so I think it should work ok. Maybe we should use the key terminology to match what's already done in tsibble rather than the unit terminology i suggested in this issue.

ryantibs commented 3 years ago

Great! Thanks for checking that. So it looks the index column in tsibble must remain a single column (time_value for us) but I guess that's not a big deal.

So tsibble offers the flexibility we need then. And then it looks like then a single big refactor can be used to address #12, #10, and #7 all at once.

One minor question before I attempt anything, which I should probably ask on #7, but just for convenience let me ask here. Should we keep the column naming scheme index and key to be exactly as in tsibble? Meaning, no time_value and geo_value named columns?

elray1 commented 3 years ago

Yeah, I guess I have a weak preference to generally match what's being done for tsibble and avoid introducing new terms.

jacobbien commented 3 years ago

This sounds good to me. So the function would then be named slide_by_key?

brookslogan commented 2 years ago

Another type of "unit" to consider: forecasting tasks. E.g., in the AR example, ahead should be another part of the key; if we tried to use slide_by_geo on rbound results for multiple aheads, I would expect it to break / give invalid results.

ryantibs commented 2 years ago

Just a heads up that I'm going to close this with #14.