ScottishCovidResponse / SCRCIssueTracking

Central issue tracking repository for all repos in the consortium
6 stars 0 forks source link

Add full logging option to stochastic version of model #687

Closed magicicada closed 4 years ago

magicicada commented 4 years ago

Output from inference runs where multiple series are generated. Output of all runs should be saved with a view to disk, with a view to being able to put these in the data pipeline. Changes to the data registry (i.e. a field that defines a "group" of runs) may be required. Another option would be to concatenate outputs with a run number or seed column.

github-actions[bot] commented 4 years ago

Heads up @magicicada @bobturneruk @aflag @WPettersson @alex-konovalov @may1066 @mrow84 - the "Simple Network Sim" label was applied to this issue.

mrow84 commented 4 years ago

I may have confused the issue in our standup. If you are doing what is effectively a "single run", from the perspective of the data pipeline, that generates multiple realisations of the model, then that should just work as normal as long as you can work around the memory issues that you referred to.

My confusion was caused by thinking that you were going to track each realisation separately, and then try to relate the runs subsequently - it is that that is not well-modelled in the data registry.

FiodorG commented 4 years ago

I may have confused the issue in our standup. If you are doing what is effectively a "single run", from the perspective of the data pipeline, that generates multiple realisations of the model, then that should just work as normal as long as you can work around the memory issues that you referred to.

My confusion was caused by thinking that you were going to track each realisation separately, and then try to relate the runs subsequently - it is that that is not well-modelled in the data registry.

Yes definitely the first case. Even with stochastic mode and multiprocessed runs, we should always be able to remain fully reproducible on the random numbers generated, and this means indeed it will be a single run from the perspective of the data pipeline. As @magicicada mentions we can key the runs by child seed id or run id. But bundling all results together results in rather chunky dataframes (could go to a few gb for large networks)

aflag commented 4 years ago

What would be really nice would be to have all "runs" in the same output file but with different components.

mrow84 commented 4 years ago

What would be really nice would be to have all "runs" in the same output file but with different components.

That is perfectly fine afaik, the only issue is you managing your interaction with the API so that you do all the writes in the same session. We could potentially make it so a session could be longer than a process, but it would take some work, so I'd rather not unless strictly necessary.