HopkinsIDD / flepiMoP

The Flexible Epidemic Modeling Pipeline
https://flepimop.org
GNU General Public License v3.0
9 stars 4 forks source link

Structure of simulation output is not generalizable to n-dimensional epi models (n subpopulations) #253

Open twallema opened 4 months ago

twallema commented 4 months ago

Problem description

To familiarize myself with flepi, I've simulated the model with two subpopulations that lives in ~/examples/tutorial_two_subpopulations. This model is a space-stratified SEIR model simulated for a 'large_province' and a 'small_province'. The output of the model, as loaded form the .parquet (why ?) output folder using pandas, has the following format,

    mc_value_type   mc_inf  large_province  small_province  date
0   incidence   S   0.000000    0.000000    2020-02-01
1   incidence   E   0.271162    0.000307    2020-02-01
2   incidence   I   1.127310    0.000024    2020-02-01
3   incidence   R   0.109133    0.000001    2020-02-01
4   incidence   S   0.000000    0.000000    2020-02-02
... ... ... ... ... ...
963 prevalence  R   8012.316100 876.539201  2020-05-30
964 prevalence  S   971.013228  111.283609  2020-05-31
965 prevalence  E   4.006324    2.945311    2020-05-31
966 prevalence  I   10.459541   7.625265    2020-05-31
967 prevalence  R   8014.520907 878.145815  2020-05-31

The peculiarity to me in this output is that the subpopulations 'large_province' and 'small_province' appear as columns with the numerical simulation output listed under them, rather than as an 'index' column. This massively complicates aggregations and is not portable to n subpopulations. Aggregating prevalence over the spatial component in this model can be done as follows,

df[df['mc_value_type'] == 'prevalence']['small_province'] + df[df['mc_value_type'] == 'prevalence']['large_province']

Although this is simple for 2 subpopulations, image what you have to when you go from 2 spatial units to 48 US states,

  1. Extract the 48 names of the spatial units (spatial 'coordinates'). You will do this using a call to df.columns, but the names of your spatial units are a subset of the column names, so now you wind up doing something hacky like df.columns[3:].
  2. You then set up a for loop over the 48 spatial units, slice the output and sum.
  3. You need to add additional code if you don't want to have a numpy array as output.
# get column names
spatial_units = df.columns[3:]
# aggregate
data = 0
for su in spatial_units:
    data += df[df['mc_value_type'] == 'prevalence'][su].values

The use of for loops to aggregate data is computationally inefficient, opposed to pandas groupby operations. Further, if the epidemiological model had two dimensions, say 'age' and 'space', this procedure would become even more complicated. In that case, how would the columns even be named? How would you extract the coordinates of your two dimensions from those column names? Plus, to aggregate over both dimensions you would wind up using a nested for loop. If I want to simulate 581 municipalities and 4 age groups, do I wind up having 2400 columns in the simulation output?

Proposed (short-term) fix

What seems most natural is that the columns 'incidence' and 'prevalence' are 'data variables', while 'date', 'space', and 'disease state' are index values (there is an 'incidence' value and a 'prevalence' value for every combination of 'date'-'space'-'disease state'). Under that paradigm, the following output structure feels more natural,

    date            spatial_patch   state   incidence   prevalence
0   2020-02-01  small_province  S   0.0         1000.0
1   2020-02-01  small_province  E   0.000307    0.0
2   2020-02-01  small_province  I   0.000024    0.0
3   2020-02-01  small_province  R   0.000001    0.0
4   2020-02-01  large_province  S   0.0.            8995.0
... ... ... ... ... ...
963 2020-05-31  small_province  R   1.448748    878.145815
964 2020-05-31  large_province  S   0.0         971.013228
965 2020-05-31  large_province  E   0.552228    4.006324
966 2020-05-31  large_province  I   0.950878    10.459541
967 2020-05-31  large_province  R   1.986548    8014.520907

Which contains a column with the name of the spatial 'dimension' ('spatial_patch'), with the column value being the 'coordinates' of the spatial 'dimension' ('small_province', 'large_province'). Structuring the output in this way allows to more naturally slice out values for a given spatial unit, or to perform aggregations. For instance, if we want to sum the prevalence over the spatial patches,

df.groupby(by=['date', 'disease_state']).agg({'prevalence': 'sum'})

Which is generalizable to n model stratifications.

Proposed (long-term) fix

Exploitxarray datasets to return simulation results, which use "labeled dimensions" to represent model states and allow to do aggregations super easily (see xarray documentation).

For an example of why structuring simulation results using xarray is so powerful, see the quickstart tutorial of pySODM.

saraloo commented 4 months ago

I'm a bit confused about the issue here. I'm not a python expert so can't comment on efficiency (and i thought we do use xarray, but @jcblemai knows more here) but just a few comments on the file structure and extending to more sub-populations and compartmental stratification.

Note that this is the standard structure we have been using for the 50 US states, where there is complex stratification by vaccination, variant, and age. See:

Screen Shot 2024-07-16 at 7 23 48 AM

For your question regarding the column names see the brief description of the seir files in the documentation on model outputs. These columns are always named in the same way, and extracting the subpops (note we use the term subpop rather than spatial patch) should be extracted from the geodata file defined in the config, so there shouldn't be any need to do anything 'hacky' with regards to getting the column names. The stratification is defined in the compartments section of the config, and these are written as the separate columns in the screenshot above. e.g.

compartments:
  infection_stage: ["S", "W", "E", "I1", "I2", "I3", "R"] 
  vaccination_stage: ["unvaccinated", "1dose", "2dose", "waned"] 
  variant_type: ["WILD", "ALPHA", "DELTA", "OMICRON] 
  age_strata: ["age0to17", "age18to64", "age65to100"] 

These are then written as mc_[] columns for easy aggregation across strata of interest (ie if we just want to look at one specific age group). There is always the mc_value_type column which defines prevalence or incidence, and then also the additional mc_name column which is the stitched together name of each specific compartment that is used by gempyor.

There is also some functionality to look at seir outputs in the config precheck notebook which might be helpful.

Am also not sure I'm clear why not re: your question about why these are saved as parquet files? These files are huge.

jcblemai commented 4 months ago

Thanks for opening this discussion @twallema @saraloo.

The output of the model, as loaded form the .parquet (why ?)

If the question is why parquet, it's because for columnar data and dataframe, it is much (much much) faster, more compact (by a lot) and way safer than csv. While when we adopted it was quite rare in the epi community, it has now become standard as the data needs increased.

Sara (thanks) provided an answer to how the meta-compartment would be named for several dimensions. The difference between a subpopulation and a meta compartment is a bit subtle and more of a choice of practice (how you want your inference to be done, and you output to be structured).

I just want to highlight that in your examples, that df.columns[3:] would fail for other configs, so it is not generalizable (as you guessed correctly).

To do it without loops I usually use something like:

df = gempyor.read_df(seir_filenames[0])
df = df.set_index("date")
geoids = [c for c in df.columns if ('mc_' not in c)], axis=1)

and then I pivot or do my things.

For your proposed short term fix, the problem with long-form is that it is slow and takes up a lot of space (you can try it), especially with many subpopulations. We did this once and switched to this format for speed. Moreover, seir is (but actually was) mainly used for internal communication (between seir and outcomes, because you can run several outcomes on the same seir file, again for speed at the time) so speed of writing and wrangling was important, and wide data frame also win these benchmark (vs a tidy data frame you are proposing).

For the long-term fix, this is actually the plan and the right way to it, because a seir output is not a data frame but a ND-array, so it does not make sense to have it as array. In fact, gempyor.seir already transmits the simulation results to other objects as an xarray:

# in seir.py
    # We return an xarray instead of a ndarray now
    compartment_coords = {}
    compartment_df = modinf.compartments.get_compartments_explicitDF()
    # Iterate over columns of the DataFrame and populate the dictionary
    for column in compartment_df.columns:
        compartment_coords[column] = ("compartment", compartment_df[column].tolist())

    # comparment is a dimension with coordinate from each of the compartments
    states = xr.Dataset(
        data_vars=dict(
            prevalence=(["date", "compartment", "subpop"], seir_sim[0]),
            incidence=(["date", "compartment", "subpop"], seir_sim[1]),
        ),
        coords=dict(
            date=pd.date_range(modinf.ti, modinf.tf, freq="D"),
            **compartment_coords,
            subpop=modinf.subpop_struct.subpop_names,
        ),
        attrs=dict(description="Dynamical simulation results", run_id=modinf.in_run_id),  # TODO add more information
    )

If gempyor flag autowrite_seir is False, then gempyor does not even write the seir file which saves a lot of time. Now, I think to close this issue we should add the ability to save the (currently internal only) xarray as a .hdf, and then update some postprocessing script. We are trying to move the internal representation of many of these objects from dataframes to arrays

Let me know if something is not covered in @saraloo and my answers.

twallema commented 4 months ago

Hey both, thanks for the swift replies to my inquiries. My confusion indeed arose from the "special" place the spatial component and mobility matrix hold when it comes to possible model stratifications. I think this is sensible given flepimop seems to revolve heavily on spatial patches with mobility between them. I'm guessing if the model is 1D (as in one stratification), you may probably interchange space+mobility matrix and age+social contact matrix in its current form?

Anyway, nice to hear you are already considering using xarray. As I'm exploring, I've made some changes to seir.py on a local branch to have it output a copy of the xarray output by default (see screenshot). Some additional changes I would like to make:

  1. I would still like to remove the prescript mc_ before my infection_stage, so that the names of the compartments defined by the user in the config file are retained in the xarray output.
  2. Perhaps the parameters and their values can be attached as metadata to the data frame? This would eliminate the need for the extra folder.
  3. When making multiple simulations with nslots, we could regard this as adding another dimension to the N-d simulation output array. If the user wants 100 simulations, it would be nice to automatically add a new dimension to the xarray dataset named draws or samples that can accommodate this new dimension. xarray would make it very easy to compute means and quantiles.

Haven't had time to push this to a remote repository yet. I'm hesitant to open a PR yet because I don't yet know how to check if I break stuff. I want to add age groups to my sample model and see if this still works if the model has 2 stratifications. Are there unit tests I can use to see if I break things?

Screenshot 2024-07-17 at 7 14 18 PM (2)

saraloo commented 4 months ago

Yes to this:

I'm guessing if the model is 1D (as in one stratification), you may probably interchange space+mobility matrix and age+social contact matrix in its current form?

Regarding your proposed changes - don't change anything regarding file structure or column names in a PR. If you want to change things to your own version of anything you use I have no issue with that, but changing these structures will have a lot of downward issues (not just for the core software code but for all of our projects). Also note that any changes regarding metadata or anything or arrays etc also would, in general, need to work in both R and python as that is where most of our inputs and outputs are processed right now.

For file size or efficiency it makes more sense to me to keep parameters separate as SEIR files are already massive and we can just use parameter files to regenerate SEIR files etc. We also use these parameters to resume our slots/chains so we need these separate to do this quickly and effectively without having to keep SEIR files (and seir files do not need to be saved anyway, as @jcblemai noted above)

Note that the points you mention that you would like to change can be done outside of the core code (post-processing) so I don't recommend changing any of the code at this moment. We could all have a discussion about file names or model output separately (eg relating also to #257) but this would be a larger structural conversation.

twallema commented 4 months ago

For the record, switching to this code at the end of steps_SEIR() (in seir.py).

    # TODO: Simulation input arrives as a time x subpop x compartment_1 x compartment_2 x etc. n-D array
    # Incidence is computed in postprocessing

    # We return an xarray instead of a ndarray now
    compartment_coords = {}
    compartment_df = modinf.compartments.get_compartments_explicitDF()

    # Iterate over columns of the DataFrame and populate the dictionary
    for column in [col for col in compartment_df.columns if col != 'mc_name']:
        compartment_coords[column[3:]] = compartment_df[column].unique().tolist()

    # TODO: likely more orderly if dimensions are ordered: date, subpop, all remaining stratifications
    # construct a list of dimension names 
    dimnames = ["date",]
    for dim in compartment_coords.keys():
        dimnames.append(dim) # drop the mc_ prescript
    dimnames.append("subpop")

    # construct the shape the actual N-d array coming from the integrator ideally should have
    desired_shape = [seir_sim[0].shape[0]] # time axis
    desired_shape.extend([len(coords) for coords in compartment_coords.values()]) # compartments
    desired_shape.append(seir_sim[0].shape[-1]) # subpops

    # comparment is a dimension with coordinate from each of the compartments
    states = xr.Dataset(
        data_vars=dict(
            prevalence=(dimnames, seir_sim[0].reshape(desired_shape)),
            incidence=(dimnames, seir_sim[1].reshape(desired_shape)),
        ),
        coords=dict(
            date=pd.date_range(modinf.ti, modinf.tf, freq="D"),
            **compartment_coords,
            subpop=modinf.subpop_struct.subpop_names,
        ),
        attrs=dict(description="Dynamical simulation results", run_id=modinf.in_run_id),  # TODO add parameters as metadata here? and omit having a separate folder for them?
    )

    return states

Outputs the code in an xarray with the following format (for an age- and space structured SIR model),

Screenshot 2024-07-22 at 1 35 24 PM (2)

twallema commented 4 months ago

Linking to a tryout PR #264