JohannesBuchner / autoemcee

Run MCMC automatically to convergence
Other
8 stars 0 forks source link

Final results of esimation for each process instead of having only one statistical result #3

Closed montechris1 closed 1 year ago

montechris1 commented 1 year ago

Description

during the execution, at the end, the FINAL STATS are printed for each process instead of having a unique FINAL STATS

This occurs for example with 8 processes on mpirun : Here the log output where there are 8 printed FINAL STATS :

[autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Sampling progress : 100%|██████████| 100/100 [06:07<00:00, 3.67s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:12<00:00, 3.72s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:13<00:00, 3.74s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:16<00:00, 3.77s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:18<00:00, 3.79s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:25<00:00, 3.86s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:57<00:00, 4.17s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [07:12<00:00, 4.32s/it] global sampling for starting point ... Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Sampling progress : 1%| | 1/100 [00:02<04:44, 2.87s/it]Initialising ensemble of 100 walkers... Sampling progress : 2%|▏ | 2/100 [00:07<06:07, 3.75s/it]Initialising ensemble of 100 walkers... Sampling progress : 9%|▉ | 9/100 [00:37<06:37, 4.37s/it]Initialising ensemble of 100 walkers... Sampling progress : 12%|█▏ | 12/100 [00:51<06:27, 4.40s/it]Initialising ensemble of 100 walkers... Sampling progress : 100%|██████████| 100/100 [07:46<00:00, 4.67s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [07:48<00:00, 4.68s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [08:02<00:00, 4.83s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [07:59<00:00, 4.80s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [08:16<00:00, 4.97s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [08:29<00:00, 5.09s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [07:49<00:00, 4.69s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [08:01<00:00, 4.82s/it] global sampling for starting point ... Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Sampling progress : 2%|▏ | 2/100 [00:06<05:30, 3.37s/it]Initialising ensemble of 100 walkers... Sampling progress : 4%|▍ | 4/100 [00:14<05:52, 3.67s/it]Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Sampling progress : 4%|▍ | 4/100 [00:13<05:28, 3.42s/it]Initialising ensemble of 100 walkers... Sampling progress : 7%|▋ | 7/100 [00:24<05:47, 3.73s/it]Initialising ensemble of 100 walkers... Sampling progress : 12%|█▏ | 12/100 [00:43<05:23, 3.68s/it]Initialising ensemble of 100 walkers... Sampling progress : 100%|██████████| 100/100 [06:20<00:00, 3.81s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:06<00:00, 3.67s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:20<00:00, 3.81s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:49<00:00, 4.10s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:22<00:00, 3.83s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:57<00:00, 4.17s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:32<00:00, 3.93s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:37<00:00, 3.97s/it] global sampling for starting point ... Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Sampling progress : 4%|▍ | 4/100 [00:15<06:25, 4.01s/it]Initialising ensemble of 100 walkers... Sampling progress : 1%| | 1/100 [00:02<04:38, 2.82s/it]Initialising ensemble of 100 walkers... Sampling progress : 2%|▏ | 2/100 [00:06<05:44, 3.51s/it]Initialising ensemble of 100 walkers... Sampling progress : 6%|▌ | 6/100 [00:22<05:55, 3.79s/it]Initialising ensemble of 100 walkers... Sampling progress : 7%|▋ | 7/100 [00:25<05:44, 3.71s/it]Initialising ensemble of 100 walkers... Sampling progress : 9%|▉ | 9/100 [00:32<05:38, 3.72s/it]Initialising ensemble of 100 walkers... Sampling progress : 100%|██████████| 100/100 [06:41<00:00, 4.01s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00196707 1.00135042 1.0016244 1.00179508 1.00148762 1.00048871] (<1.010 is good) rhat chain diagnostic: [1.00196707 1.00135042 1.0016244 1.00179508 1.00148762 1.00048871] (<1.010 is good) [autoemcee] converged!!! converged!!!

Omega_m             0.183 +- 0.024
Omega_k             -0.0001 +- 0.0058
H0                  68.4 +- 1.2
Phi_0               79744160010 +- 40284130873
d_Phi_0             -10992155 +- 6352511
omega_BD            70515 +- 17406

(40000, 6)

FINAL STATS: Omega_m: 0.15639802639816397 Omega_k: 0.009976093154957207 Omega_r: 0.0001 Omega_BD: 0.8335258804468789 H0: 67.12178862560644 Psi_0: 1.778048376542223 F0: 0.05379493502536032 omega_BD: 77571.26985710289

Sampling progress : 100%|██████████| 100/100 [06:33<00:00, 3.94s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00352015 1.00098207 1.00353306 1.00162517 1.00149577 1.00149548] (<1.010 is good) rhat chain diagnostic: [1.00352015 1.00098207 1.00353306 1.00162517 1.00149577 1.00149548] (<1.010 is good) [autoemcee] converged!!! converged!!!

Omega_m             0.184 +- 0.025
Omega_k             0.0001 +- 0.0057
H0                  68.3 +- 1.1
Phi_0               78518970836 +- 40266375457
d_Phi_0             -10870699 +- 6449349
omega_BD            70253 +- 17330

(40000, 6)

FINAL STATS: Omega_m: 0.15961888713916716 Omega_k: -0.0017439295660335412 Omega_r: 0.0001 Omega_BD: 0.8420250424268664 H0: 67.04154582194079 Psi_0: 1.6452424283907559 F0: 0.04673236230011184 omega_BD: 94912.94520157117

Sampling progress : 100%|██████████| 100/100 [06:11<00:00, 3.71s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00231399 1.00132766 1.00357955 1.00323934 1.00119348 1.00196544] (<1.010 is good) rhat chain diagnostic: [1.00231399 1.00132766 1.00357955 1.00323934 1.00119348 1.00196544] (<1.010 is good) [autoemcee] converged!!! converged!!!

Omega_m             0.184 +- 0.025
Omega_k             -0.0001 +- 0.0058
H0                  68.3 +- 1.2
Phi_0               78731598046 +- 39945916434
d_Phi_0             -10889968 +- 6437142
omega_BD            69559 +- 17337

(40000, 6)

FINAL STATS: Omega_m: 0.15377912261720894 Omega_k: -0.008967959525578545 Omega_r: 0.0001 Omega_BD: 0.8550888369083696 H0: 67.09728819489996 Psi_0: 1.0252677746640286 F0: 0.050570090391859904 omega_BD: 91669.5186298668

Sampling progress : 100%|██████████| 100/100 [06:54<00:00, 4.14s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00296294 1.00186939 1.00308325 1.00181591 1.00093337 1.00169235] (<1.010 is good) rhat chain diagnostic: [1.00296294 1.00186939 1.00308325 1.00181591 1.00093337 1.00169235] (<1.010 is good) [autoemcee] converged!!! converged!!!

Omega_m             0.184 +- 0.025
Omega_k             0.0001 +- 0.0058
H0                  68.3 +- 1.1
Phi_0               79962885094 +- 39894161698
d_Phi_0             -10974496 +- 6517140
omega_BD            70043 +- 17460

(40000, 6)

FINAL STATS: Omega_m: 0.1542922882734884 Omega_k: 0.009448010536412676 Omega_r: 0.0001 Omega_BD: 0.836159701190099 H0: 67.05075560438242 Psi_0: 0.7469678533268315 F0: 0.05318249713731035 omega_BD: 85476.51323950474

Sampling progress : 100%|██████████| 100/100 [06:43<00:00, 4.03s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00197528 1.00071091 1.00088684 1.00135168 1.00113907 1.00221186] (<1.010 is good) rhat chain diagnostic: [1.00197528 1.00071091 1.00088684 1.00135168 1.00113907 1.00221186] (<1.010 is good) [autoemcee] converged!!! converged!!!

Omega_m             0.185 +- 0.026
Omega_k             -0.0001 +- 0.0057
H0                  68.3 +- 1.1
Phi_0               78952438875 +- 39977425967
d_Phi_0             -11138925 +- 6404007
omega_BD            69461 +- 17331

(40000, 6)

FINAL STATS: Omega_m: 0.16008718311732176 Omega_k: -0.004770922417308488 Omega_r: 0.0001 Omega_BD: 0.8445837392999868 H0: 67.03221065700268 Psi_0: 0.694703657234769 F0: 0.04613852509662548 omega_BD: 90203.55263342986

Sampling progress : 100%|██████████| 100/100 [06:38<00:00, 3.98s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00490598 1.00064389 1.00204776 1.00377143 1.00195137 1.00121317] (<1.010 is good) rhat chain diagnostic: [1.00490598 1.00064389 1.00204776 1.00377143 1.00195137 1.00121317] (<1.010 is good) [autoemcee] converged!!! converged!!!

Omega_m             0.184 +- 0.025
Omega_k             0.0002 +- 0.0058
H0                  68.3 +- 1.1
Phi_0               79473155841 +- 40465133141
d_Phi_0             -11022139 +- 6448596
omega_BD            70293 +- 17352

(40000, 6)

FINAL STATS: Omega_m: 0.1556382353091075 Omega_k: -0.004391224813644421 Omega_r: 0.0001 Omega_BD: 0.8486529895045369 H0: 67.02538342993125 Psi_0: 1.2974283695196227 F0: 0.059922563630883094 omega_BD: 60031.74758036217

Sampling progress : 100%|██████████| 100/100 [06:22<00:00, 3.83s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.0023371 1.00304344 1.00268818 1.00164072 1.00077296 1.00134696] (<1.010 is good) rhat chain diagnostic: [1.0023371 1.00304344 1.00268818 1.00164072 1.00077296 1.00134696] (<1.010 is good) [autoemcee] converged!!! converged!!!

Omega_m             0.183 +- 0.025
Omega_k             -0.0000 +- 0.0058
H0                  68.4 +- 1.2
Phi_0               79208658197 +- 40233723348
d_Phi_0             -10861721 +- 6421897
omega_BD            69250 +- 17297

(40000, 6)

FINAL STATS: Omega_m: 0.15560991114289502 Omega_k: 0.00561150977046878 Omega_r: 0.0001 Omega_BD: 0.8386785790866362 H0: 67.02120194124274 Psi_0: 1.5676703673587582 F0: 0.050339341500386 omega_BD: 85869.17672688983

Sampling progress : 100%|██████████| 100/100 [06:21<00:00, 3.81s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00215479 1.0014721 1.00254023 1.00172355 1.001385 1.00166689] (<1.010 is good) rhat chain diagnostic: [1.00215479 1.0014721 1.00254023 1.00172355 1.001385 1.00166689] (<1.010 is good) [autoemcee] converged!!! converged!!!

Omega_m             0.184 +- 0.025
Omega_k             -0.0001 +- 0.0058
H0                  68.4 +- 1.2
Phi_0               79298896729 +- 40061578110
d_Phi_0             -10679283 +- 6398438
omega_BD            69714 +- 17356

(40000, 6)

FINAL STATS: Omega_m: 0.15477634205458007 Omega_k: 0.0034101599390712715 Omega_r: 0.0001 Omega_BD: 0.8417134980063488 H0: 67.05942282344456 Psi_0: 1.751412846488866 F0: 0.05372648650854242 omega_BD: 80885.69300612016

mpirun -np 8 18873.29s user 104.51s system 788% cpu 40:05.43 total

(my command of running is :

$ time mpirun -np 8 python3.9 BD_MCMC_autoemcee.py

I don't know how to get a unique FINAL STATS, it is a weird issue.

I am using mpirun version :

$ mpirun --version mpirun (Open MPI) 4.1.5

Report bugs to http://www.open-mpi.org/community/help/

Any help is welcome.

montechris1 commented 1 year ago

Finally fixed by updating Intel python 3.9.

JohannesBuchner commented 1 year ago

If you have any MPI issues, test your MPI first in isolation, by running this command:

mpiexec -np 4 python3 -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size())'

This should give something like:

3 4
1 4
0 4
2 4

With the first column randomly. If it gives an output like the above, your MPI is working. If the last column is 1, your cores are not communicating. If you get an error, fix it first.

montechris1 commented 1 year ago

@JohannesBuchner . My MPI-4.0 installation is correct ; I get the expected result with your example above :

mpiexec -np 4 python3 -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size())' 1 4 2 4 3 4 0 4

I have 2 availables MPI launcher :

1) Intel which doesn't seem to support all MPI-4.0 functionalities ( or partially)

$ which mpiexec /opt/intel/oneapi/intelpython/latest/bin/mpiexec

$ which mpirun /opt/intel/oneapi/intelpython/latest/bin/mpirun

2) And official MPI-4 launcher :

$ which mpirun.openmpi /usr/bin/mpirun.openmpi

Where does the issue come from ? Which one of the 2 MPI to use ?

Regards

JohannesBuchner commented 1 year ago

"FINAL STATS" is not part of autoemcee's code.

Be aware that MPI runs N processes independently, which can communicate. If you want one output, maybe put a if that checks that rank == 0.

In the autoemcee you can find some such ifs, self.log is a shortcut for it, i.e., whether it should log because it is the rank 0 process.

montechris1 commented 1 year ago

Hi Johannes,

Thanks for your quick reply. Please if you could indicate me which modifications to do into autoemcee.py to get only the final result with a "Wait All chains finished if convergence" and display the final results gathered in rank 0 , I would be grateful since I have not the skills to modify this part of code.

Best regards

Le mer. 30 août 2023 à 15:32, Johannes Buchner @.***> a écrit :

Closed #3 https://github.com/JohannesBuchner/autoemcee/issues/3 as completed.

— Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/autoemcee/issues/3#event-10233191511, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB467V3DTATALPCK3HE7FKLXX46IJANCNFSM6AAAAAA3TZF4HY . You are receiving this because you authored the thread.Message ID: @.***>

JohannesBuchner commented 1 year ago

put if sampler.log: before your code that outputs

FINAL STATS:
Omega_m: 0.15477634205458007
Omega_k: 0.0034101599390712715
Omega_r: 0.0001
Omega_BD: 0.8417134980063488
H0: 67.05942282344456
Psi_0: 1.751412846488866
F0: 0.05372648650854242
omega_BD: 80885.69300612016
montechris1 commented 1 year ago

Hi,

your modification doesn't solve my issue, see capture attached

issue_multiple_final

You can see that computation of chains continue whereas it is indicated "convergence" is reached.

I don't see where the problem come from, any help would be great.

Regards

JohannesBuchner commented 1 year ago

this cannot be resolved without seeing your code

montechris1 commented 1 year ago

Hi Johannes,

in attachment the MCMC code that continues to run even if convergence is reached and multiple final estimations are displayed. If you could take a look at it, this would be great to fix this issue and get only one final estimation and code stops.

Best regards

Le sam. 2 sept. 2023 à 16:16, Johannes Buchner @.***> a écrit :

this cannot be resolved without seeing your code

— Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/autoemcee/issues/3#issuecomment-1703845883, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB467VYC3MU4TJN2BPSWUSLXYM5SHANCNFSM6AAAAAA3TZF4HY . You are receiving this because you authored the thread.Message ID: @.***>

montechris1 commented 1 year ago

Hi Johannes,

did you get a chance to look at my code in attachment ?

Best regards, chris

Le sam. 2 sept. 2023 à 16:16, Johannes Buchner @.***> a écrit :

this cannot be resolved without seeing your code

— Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/autoemcee/issues/3#issuecomment-1703845883, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB467VYC3MU4TJN2BPSWUSLXYM5SHANCNFSM6AAAAAA3TZF4HY . You are receiving this because you authored the thread.Message ID: @.***>

JohannesBuchner commented 1 year ago

It did not come through, see https://github.com/JohannesBuchner/autoemcee/issues/3

dmikushin commented 1 year ago

Hi @montechris1 , I came here from the page where you ask for help with MPI code.

Unfortunately, this issue description is very hard to follow. Please try to describe very clearly what is your test case, what is the result and what bug/issue you are possibly experiencing.

Ideally, you should even take out the domain-specific program for the time being, and discuss only the MPI communication patterns, such as send/recv or scatter/gather.

JohannesBuchner commented 1 year ago

I can add that autoemcee uses only gather/bcast of float arrays, and sometimes single integers. e.g. https://github.com/JohannesBuchner/autoemcee/blob/master/autoemcee.py#L353