Open twallema opened 4 months ago
I'm a bit confused about the issue here. I'm not a python expert so can't comment on efficiency (and i thought we do use xarray
, but @jcblemai knows more here) but just a few comments on the file structure and extending to more sub-populations and compartmental stratification.
Note that this is the standard structure we have been using for the 50 US states, where there is complex stratification by vaccination, variant, and age. See:
For your question regarding the column names see the brief description of the seir files in the documentation on model outputs. These columns are always named in the same way, and extracting the subpops (note we use the term subpop rather than spatial patch) should be extracted from the geodata file defined in the config, so there shouldn't be any need to do anything 'hacky' with regards to getting the column names. The stratification is defined in the compartments section of the config, and these are written as the separate columns in the screenshot above. e.g.
compartments:
infection_stage: ["S", "W", "E", "I1", "I2", "I3", "R"]
vaccination_stage: ["unvaccinated", "1dose", "2dose", "waned"]
variant_type: ["WILD", "ALPHA", "DELTA", "OMICRON]
age_strata: ["age0to17", "age18to64", "age65to100"]
These are then written as mc_[]
columns for easy aggregation across strata of interest (ie if we just want to look at one specific age group). There is always the mc_value_type
column which defines prevalence or incidence, and then also the additional mc_name
column which is the stitched together name of each specific compartment that is used by gempyor
.
There is also some functionality to look at seir outputs in the config precheck notebook which might be helpful.
Am also not sure I'm clear why not re: your question about why these are saved as parquet files? These files are huge.
Thanks for opening this discussion @twallema @saraloo.
The output of the model, as loaded form the .parquet (why ?)
If the question is why parquet, it's because for columnar data and dataframe, it is much (much much) faster, more compact (by a lot) and way safer than csv. While when we adopted it was quite rare in the epi community, it has now become standard as the data needs increased.
Sara (thanks) provided an answer to how the meta-compartment would be named for several dimensions. The difference between a subpopulation and a meta compartment is a bit subtle and more of a choice of practice (how you want your inference to be done, and you output to be structured).
I just want to highlight that in your examples, that df.columns[3:]
would fail for other configs, so it is not generalizable (as you guessed correctly).
To do it without loops I usually use something like:
df = gempyor.read_df(seir_filenames[0])
df = df.set_index("date")
geoids = [c for c in df.columns if ('mc_' not in c)], axis=1)
and then I pivot or do my things.
For your proposed short term fix, the problem with long-form is that it is slow and takes up a lot of space (you can try it), especially with many subpopulations. We did this once and switched to this format for speed. Moreover, seir is (but actually was) mainly used for internal communication (between seir and outcomes, because you can run several outcomes on the same seir file, again for speed at the time) so speed of writing and wrangling was important, and wide data frame also win these benchmark (vs a tidy data frame you are proposing).
For the long-term fix, this is actually the plan and the right way to it, because a seir output is not a data frame but a ND-array, so it does not make sense to have it as array. In fact, gempyor.seir already transmits the simulation results to other objects as an xarray:
# in seir.py
# We return an xarray instead of a ndarray now
compartment_coords = {}
compartment_df = modinf.compartments.get_compartments_explicitDF()
# Iterate over columns of the DataFrame and populate the dictionary
for column in compartment_df.columns:
compartment_coords[column] = ("compartment", compartment_df[column].tolist())
# comparment is a dimension with coordinate from each of the compartments
states = xr.Dataset(
data_vars=dict(
prevalence=(["date", "compartment", "subpop"], seir_sim[0]),
incidence=(["date", "compartment", "subpop"], seir_sim[1]),
),
coords=dict(
date=pd.date_range(modinf.ti, modinf.tf, freq="D"),
**compartment_coords,
subpop=modinf.subpop_struct.subpop_names,
),
attrs=dict(description="Dynamical simulation results", run_id=modinf.in_run_id), # TODO add more information
)
If gempyor flag autowrite_seir
is False, then gempyor does not even write the seir file which saves a lot of time. Now, I think to close this issue we should add the ability to save the (currently internal only) xarray as a .hdf, and then update some postprocessing script. We are trying to move the internal representation of many of these objects from dataframes to arrays
Let me know if something is not covered in @saraloo and my answers.
Hey both, thanks for the swift replies to my inquiries. My confusion indeed arose from the "special" place the spatial component and mobility matrix hold when it comes to possible model stratifications. I think this is sensible given flepimop seems to revolve heavily on spatial patches with mobility between them. I'm guessing if the model is 1D (as in one stratification), you may probably interchange space+mobility matrix and age+social contact matrix in its current form?
Anyway, nice to hear you are already considering using xarray
. As I'm exploring, I've made some changes to seir.py
on a local branch to have it output a copy of the xarray
output by default (see screenshot). Some additional changes I would like to make:
mc_
before my infection_stage
, so that the names of the compartments defined by the user in the config file are retained in the xarray output.nslots
, we could regard this as adding another dimension to the N-d simulation output array. If the user wants 100 simulations, it would be nice to automatically add a new dimension to the xarray
dataset named draws
or samples
that can accommodate this new dimension. xarray
would make it very easy to compute means and quantiles.Haven't had time to push this to a remote repository yet. I'm hesitant to open a PR yet because I don't yet know how to check if I break stuff. I want to add age groups to my sample model and see if this still works if the model has 2 stratifications. Are there unit tests I can use to see if I break things?
Yes to this:
I'm guessing if the model is 1D (as in one stratification), you may probably interchange space+mobility matrix and age+social contact matrix in its current form?
Regarding your proposed changes - don't change anything regarding file structure or column names in a PR. If you want to change things to your own version of anything you use I have no issue with that, but changing these structures will have a lot of downward issues (not just for the core software code but for all of our projects). Also note that any changes regarding metadata or anything or arrays etc also would, in general, need to work in both R and python as that is where most of our inputs and outputs are processed right now.
For file size or efficiency it makes more sense to me to keep parameters separate as SEIR files are already massive and we can just use parameter files to regenerate SEIR files etc. We also use these parameters to resume our slots/chains so we need these separate to do this quickly and effectively without having to keep SEIR files (and seir files do not need to be saved anyway, as @jcblemai noted above)
Note that the points you mention that you would like to change can be done outside of the core code (post-processing) so I don't recommend changing any of the code at this moment. We could all have a discussion about file names or model output separately (eg relating also to #257) but this would be a larger structural conversation.
For the record, switching to this code at the end of steps_SEIR()
(in seir.py
).
# TODO: Simulation input arrives as a time x subpop x compartment_1 x compartment_2 x etc. n-D array
# Incidence is computed in postprocessing
# We return an xarray instead of a ndarray now
compartment_coords = {}
compartment_df = modinf.compartments.get_compartments_explicitDF()
# Iterate over columns of the DataFrame and populate the dictionary
for column in [col for col in compartment_df.columns if col != 'mc_name']:
compartment_coords[column[3:]] = compartment_df[column].unique().tolist()
# TODO: likely more orderly if dimensions are ordered: date, subpop, all remaining stratifications
# construct a list of dimension names
dimnames = ["date",]
for dim in compartment_coords.keys():
dimnames.append(dim) # drop the mc_ prescript
dimnames.append("subpop")
# construct the shape the actual N-d array coming from the integrator ideally should have
desired_shape = [seir_sim[0].shape[0]] # time axis
desired_shape.extend([len(coords) for coords in compartment_coords.values()]) # compartments
desired_shape.append(seir_sim[0].shape[-1]) # subpops
# comparment is a dimension with coordinate from each of the compartments
states = xr.Dataset(
data_vars=dict(
prevalence=(dimnames, seir_sim[0].reshape(desired_shape)),
incidence=(dimnames, seir_sim[1].reshape(desired_shape)),
),
coords=dict(
date=pd.date_range(modinf.ti, modinf.tf, freq="D"),
**compartment_coords,
subpop=modinf.subpop_struct.subpop_names,
),
attrs=dict(description="Dynamical simulation results", run_id=modinf.in_run_id), # TODO add parameters as metadata here? and omit having a separate folder for them?
)
return states
Outputs the code in an xarray with the following format (for an age- and space structured SIR model),
Linking to a tryout PR #264
Problem description
To familiarize myself with flepi, I've simulated the model with two subpopulations that lives in
~/examples/tutorial_two_subpopulations
. This model is a space-stratified SEIR model simulated for a 'large_province' and a 'small_province'. The output of the model, as loaded form the.parquet
(why ?) output folder using pandas, has the following format,The peculiarity to me in this output is that the subpopulations 'large_province' and 'small_province' appear as columns with the numerical simulation output listed under them, rather than as an 'index' column. This massively complicates aggregations and is not portable to
n
subpopulations. Aggregating prevalence over the spatial component in this model can be done as follows,Although this is simple for 2 subpopulations, image what you have to when you go from 2 spatial units to 48 US states,
df.columns
, but the names of your spatial units are a subset of the column names, so now you wind up doing something hacky likedf.columns[3:]
.The use of for loops to aggregate data is computationally inefficient, opposed to pandas groupby operations. Further, if the epidemiological model had two dimensions, say 'age' and 'space', this procedure would become even more complicated. In that case, how would the columns even be named? How would you extract the coordinates of your two dimensions from those column names? Plus, to aggregate over both dimensions you would wind up using a nested for loop. If I want to simulate 581 municipalities and 4 age groups, do I wind up having 2400 columns in the simulation output?
Proposed (short-term) fix
What seems most natural is that the columns 'incidence' and 'prevalence' are 'data variables', while 'date', 'space', and 'disease state' are index values (there is an 'incidence' value and a 'prevalence' value for every combination of 'date'-'space'-'disease state'). Under that paradigm, the following output structure feels more natural,
Which contains a column with the name of the spatial 'dimension' ('spatial_patch'), with the column value being the 'coordinates' of the spatial 'dimension' ('small_province', 'large_province'). Structuring the output in this way allows to more naturally slice out values for a given spatial unit, or to perform aggregations. For instance, if we want to sum the prevalence over the spatial patches,
Which is generalizable to n model stratifications.
Proposed (long-term) fix
Exploit
xarray
datasets to return simulation results, which use "labeled dimensions" to represent model states and allow to do aggregations super easily (see xarray documentation).For an example of why structuring simulation results using
xarray
is so powerful, see the quickstart tutorial of pySODM.