HopkinsIDD / flepiMoP

The Flexible Epidemic Modeling Pipeline
https://flepimop.org
GNU General Public License v3.0
9 stars 4 forks source link

Model Output File Management Class For `gempyor` #257

Open TimothyWillard opened 3 months ago

TimothyWillard commented 3 months ago

File IO related to model output is a bit scattered at the moment and difficult to test. There are also underlying assumptions throughout the package on the output directory structure that are challenging to change since it cannot be done in one place, let alone unit test.

A helpful abstraction would be a ModelOutput class where each instance would correspond to a single model output folder. The class would have methods for reading/writing files of a particular type (i.e. hosp, spar, etc.), the ability to accept arbitrary FolderIODriver that will handle the reading/writing of files to a folder (i.e. ParquetFolderIODriver, CsvFolderIODriver, etc.), and the ability to construct an instance from an existing folder for ease of use in post-processing/analysis.

Also will need to document output structure as a part of this (relates to GH-229). There has also already been some prior related discussion in GH-198.

I'll leave it to @jcblemai to comment on priority and fill in other details I have missed here, but I think this covers the main points.

twallema commented 3 months ago

Hi Timothy, could you read thru #253 on using xarray as the primary simulation output and see its it's complementary to this issue?

TimothyWillard commented 3 months ago

Hi @twallema! Funny enough reading that issue and working on another issue combined spurred this thought. And when discussing this issue with @jcblemai I even mentioned the possibility of other IO formats (hence the folder driver idea or whatever we end up calling it). This setup would make it easier to just use CSV files for everything when working with sample/testing configs or potentially using netcdf for xarray objects in the future.

The first pass focus will be on centralizing the directory structure logic though.

pearsonca commented 1 month ago

I have run into what I think is a related problem: the outputs aren't just nested with their corresponding configuration files.

So currently can: get config + know infrastructure => find output files (assuming a bunch of defaults). Alternatively, get output file + know infrastructure => find config file.

For my particular use, seems possible to infer configuration entries etc from the data, but all of this is a bit painful / fragile long term. Basically, want something like a single entry point object (for users / tools), which then knows how to inflate the concepts of interest independent of the underlying representation - the tools should be able to easily discover which representation is present (csv vs arrow vs database vs ...) and abstract that for the user.

TimothyWillard commented 1 month ago

@pearsonca I do not understand your comment, could you perhaps provide a concrete example of what you're describing? I don't think configuration files fall under this issue, might be better as a separate issue.

pearsonca commented 1 month ago

Sure: let's say I want to plot some outputs from a run.

I'd like to be able to do a somewhat-useful version of that just given the enclosing folder for that run. Given the known folder structure, perfectly fine to descend and grab the relevant results file(s).

But with the file(s) read in, still have to introspect out all the features (e.g. compartments, populations, etc). The alternative would be to grab those from the corresponding configuration file.

So: either have to also provide its location OR attempt to find it based on the output folder location (+some other introspection).

I think the same problem will arise for a hypothetical ModelOutput object - its probably going to want to know about the configuration associated with the output to properly structure itself.

One easy way to solve this might be to write a snapshot of the config file to the output directory?

TimothyWillard commented 1 month ago

Ah, I see. That seems slightly larger in scope then what is described in this issue currently and involves changing the output structure slightly to now add either just a copy of the config or a parsed version of it. I'll defer to @jcblemai but I think changing output structure is challenging for legacy reasons? I suppose adding a new directory should be as bad since it maintains backwards compatibility.