ScottishCovidResponse / SCRCIssueTracking

Central issue tracking repository for all repos in the consortium
6 stars 0 forks source link

Fix abort with no error message during inference #646

Open ianhinder opened 4 years ago

ianhinder commented 4 years ago

Chris is having problems with the algorithm silently stopping on long runs. He's tried

nohup mpirun -n 20 gdb --batch --quiet -ex "run" -ex "bt" -ex "quit" --args  ./beepmbp inputfile="examples/infMSOA_noage.toml" nchain=20 nsamp=20000 outputdir="Output" > output.txt&

But it doesn't show any output in output.txt as to why it stopped.

For example,

...
 Sample: 12469 / 20000
 Sample: 12470 / 20000
 Sample: 12471 / 20000
 Sample: 12472 / 20000
 Sample: 12473 / 20000
 Sample: 12474 / 20000
 Sample: 12475 / 20000
 Sample: 12476 / 20000
github-actions[bot] commented 4 years ago

Heads up @chrispooley @ianhinder @rwj11 - the "BEEPmbp" label was applied to this issue.

ianhinder commented 4 years ago

@chrispooley , is it possible that it's being killed by the operating system? Can you post the output of "ulimit -a"? This will let us know if there are any runtime limits imposed on processes. It would also be good to confirm what you said in the email, that it doesn't appear to be running out of memory. It might be useful to write the memory usage to a file periodically.