`epi_df` structure - Githubissues

cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology

https://cmu-delphi.github.io/epiprocess/

Other

13 stars 9 forks source link

`epi_df` structure #482

Closed dajmcdon closed 2 weeks ago

dajmcdon commented 1 month ago

Possibly worth adding some sorting to as_epi_df. Some ideas,

Columns are always time_value, geo_value, additional keys, other columns
Rows always vary time_value fastest, then geo_value, etc.

For the second, perhaps add a dplyr::arrange.epi_df() method.

brookslogan commented 1 month ago

On point 1., I polled some preferences regarding data-fetching argument ordering which might be relevant.

On point 2., I'm a little confused. I think arrange already works properly on epi_dfs; is it not working for you, or are you saying it should be forbidden? Related: what do you think about these approaches?:

A. epi_dfs should always have rows sorted in this manner (so we should also make sure to maintain this order after any manipulations); arrange.epi_df() should output an explanatory error message.
B. epi_dfs should be sorted upon construction (maybe only by default, controlled by an arrange arg to as_epi_df), e.g., by calling arrange() at some point, but user should be able to later re-arrange() as they'd like
C. Users should be able to choose either behavior by using a constructor argument (e.g., keep_arranged).

dajmcdon commented 1 month ago

Point 2.

Hmmm, yes, arrange() works. I guess I was imagining a function (possibly different from arrange() that would rearrange to a standard ordering if requested. We don't even have this in our saved data, for example, compare epiprocess::jhu_csse_daily_subset to epipredict::case_death_rate_subset.

dshemetov commented 1 month ago

RE point 1, we looked into similar polling when deciding for argument orders for epidatr and came to: location > time > other. Sorting by geo value and then time value seems most performant to me too, since group_by(geo_value) is common for time series operations. Do we not already do this in the constructor? Seems like something we should do. (I guess I'm assuming tibbles maintain some metadata about how they were sorted last that they can leverage to speedup groupby or other operations; not sure if this is actually the case.)

RE 2, I'm not really seeing the benefit of a "default" arrange function over a generic one that can accept different orders as arguments. Using existing arrange to sort at construction sounds good.