SciCompMod / memilio

Modular spatio-temporal models for epidemic and pandemic simulations
https://scicompmod.github.io/memilio/
Apache License 2.0
52 stars 15 forks source link

Parallelisation of runs in parameter study fails for large number of runs/tmax #1056

Open HenrZu opened 6 days ago

HenrZu commented 6 days ago

Bug description

Parallelisation of the runs in the parameter study fails because the ensemble_results of the individual ranks are all sent at once due to an Overflow of the Int value bytes_size. The maximum capacity is quickly reached if the flows are also to be saved. In my case, I use the Secir model with tmax = 250. Just the flows have a size of 72mb per run (dim 400 [# Counties] * 250[# Days] * 15[# Flows] * 6 [# Age groups] * 8[size double])

Version

Linux

To reproduce

Save the flows + results in the results processing function and do 150 runs with tmax=250.

Relevant log output

[sc-030233l:880257] * An error occurred in MPI_Send
[sc-030233l:880257] * reported by process [1905983489,9]
[sc-030233l:880257] * on communicator MPI_COMM_WORLD
[sc-030233l:880257] * MPI_ERR_COUNT: invalid count argument
[sc-030233l:880257] * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sc-030233l:880257] *    and potentially your MPI job)

Add any relevant information, e.g. used compiler, screenshots.

No response

Checklist