cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology
https://cmu-delphi.github.io/epiprocess/
Other
13 stars 9 forks source link

`epi_df` structure #482

Closed dajmcdon closed 2 weeks ago

dajmcdon commented 1 month ago

Possibly worth adding some sorting to as_epi_df. Some ideas,

  1. Columns are always time_value, geo_value, additional keys, other columns
  2. Rows always vary time_value fastest, then geo_value, etc.

For the second, perhaps add a dplyr::arrange.epi_df() method.

brookslogan commented 1 month ago

On point 1., I polled some preferences regarding data-fetching argument ordering which might be relevant.

On point 2., I'm a little confused. I think arrange already works properly on epi_dfs; is it not working for you, or are you saying it should be forbidden? Related: what do you think about these approaches?:

dajmcdon commented 1 month ago

Point 2.

Hmmm, yes, arrange() works. I guess I was imagining a function (possibly different from arrange() that would rearrange to a standard ordering if requested. We don't even have this in our saved data, for example, compare epiprocess::jhu_csse_daily_subset to epipredict::case_death_rate_subset.

dshemetov commented 1 month ago

RE point 1, we looked into similar polling when deciding for argument orders for epidatr and came to: location > time > other. Sorting by geo value and then time value seems most performant to me too, since group_by(geo_value) is common for time series operations. Do we not already do this in the constructor? Seems like something we should do. (I guess I'm assuming tibbles maintain some metadata about how they were sorted last that they can leverage to speedup groupby or other operations; not sure if this is actually the case.)

RE 2, I'm not really seeing the benefit of a "default" arrange function over a generic one that can accept different orders as arguments. Using existing arrange to sort at construction sounds good.