Closed montechris1 closed 1 year ago
Finally fixed by updating Intel python 3.9.
If you have any MPI issues, test your MPI first in isolation, by running this command:
mpiexec -np 4 python3 -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size())'
This should give something like:
3 4
1 4
0 4
2 4
With the first column randomly. If it gives an output like the above, your MPI is working. If the last column is 1, your cores are not communicating. If you get an error, fix it first.
@JohannesBuchner . My MPI-4.0 installation is correct ; I get the expected result with your example above :
mpiexec -np 4 python3 -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size())' 1 4 2 4 3 4 0 4
I have 2 availables MPI launcher :
1) Intel which doesn't seem to support all MPI-4.0 functionalities ( or partially)
$ which mpiexec /opt/intel/oneapi/intelpython/latest/bin/mpiexec
$ which mpirun /opt/intel/oneapi/intelpython/latest/bin/mpirun
2) And official MPI-4 launcher :
$ which mpirun.openmpi /usr/bin/mpirun.openmpi
Where does the issue come from ? Which one of the 2 MPI to use ?
Regards
"FINAL STATS" is not part of autoemcee's code.
Be aware that MPI runs N processes independently, which can communicate. If you want one output, maybe put a if that checks that rank == 0.
In the autoemcee you can find some such ifs, self.log is a shortcut for it, i.e., whether it should log because it is the rank 0 process.
Hi Johannes,
Thanks for your quick reply. Please if you could indicate me which modifications to do into autoemcee.py to get only the final result with a "Wait All chains finished if convergence" and display the final results gathered in rank 0 , I would be grateful since I have not the skills to modify this part of code.
Best regards
Le mer. 30 août 2023 à 15:32, Johannes Buchner @.***> a écrit :
Closed #3 https://github.com/JohannesBuchner/autoemcee/issues/3 as completed.
— Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/autoemcee/issues/3#event-10233191511, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB467V3DTATALPCK3HE7FKLXX46IJANCNFSM6AAAAAA3TZF4HY . You are receiving this because you authored the thread.Message ID: @.***>
put if sampler.log:
before your code that outputs
FINAL STATS:
Omega_m: 0.15477634205458007
Omega_k: 0.0034101599390712715
Omega_r: 0.0001
Omega_BD: 0.8417134980063488
H0: 67.05942282344456
Psi_0: 1.751412846488866
F0: 0.05372648650854242
omega_BD: 80885.69300612016
Hi,
your modification doesn't solve my issue, see capture attached
You can see that computation of chains continue whereas it is indicated "convergence" is reached.
I don't see where the problem come from, any help would be great.
Regards
this cannot be resolved without seeing your code
Hi Johannes,
in attachment the MCMC code that continues to run even if convergence is reached and multiple final estimations are displayed. If you could take a look at it, this would be great to fix this issue and get only one final estimation and code stops.
Best regards
Le sam. 2 sept. 2023 à 16:16, Johannes Buchner @.***> a écrit :
this cannot be resolved without seeing your code
— Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/autoemcee/issues/3#issuecomment-1703845883, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB467VYC3MU4TJN2BPSWUSLXYM5SHANCNFSM6AAAAAA3TZF4HY . You are receiving this because you authored the thread.Message ID: @.***>
Hi Johannes,
did you get a chance to look at my code in attachment ?
Best regards, chris
Le sam. 2 sept. 2023 à 16:16, Johannes Buchner @.***> a écrit :
this cannot be resolved without seeing your code
— Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/autoemcee/issues/3#issuecomment-1703845883, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB467VYC3MU4TJN2BPSWUSLXYM5SHANCNFSM6AAAAAA3TZF4HY . You are receiving this because you authored the thread.Message ID: @.***>
It did not come through, see https://github.com/JohannesBuchner/autoemcee/issues/3
Hi @montechris1 , I came here from the page where you ask for help with MPI code.
Unfortunately, this issue description is very hard to follow. Please try to describe very clearly what is your test case, what is the result and what bug/issue you are possibly experiencing.
Ideally, you should even take out the domain-specific program for the time being, and discuss only the MPI communication patterns, such as send/recv or scatter/gather.
I can add that autoemcee uses only gather/bcast of float arrays, and sometimes single integers. e.g. https://github.com/JohannesBuchner/autoemcee/blob/master/autoemcee.py#L353
Description
during the execution, at the end, the FINAL STATS are printed for each process instead of having a unique FINAL STATS
This occurs for example with 8 processes on mpirun : Here the log output where there are 8 printed FINAL STATS :
[autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps [autoemcee] finding starting points and running initial 100 MCMC steps Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Sampling progress : 100%|██████████| 100/100 [06:07<00:00, 3.67s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:12<00:00, 3.72s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:13<00:00, 3.74s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:16<00:00, 3.77s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:18<00:00, 3.79s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:25<00:00, 3.86s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:57<00:00, 4.17s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [07:12<00:00, 4.32s/it] global sampling for starting point ... Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Sampling progress : 1%| | 1/100 [00:02<04:44, 2.87s/it]Initialising ensemble of 100 walkers... Sampling progress : 2%|▏ | 2/100 [00:07<06:07, 3.75s/it]Initialising ensemble of 100 walkers... Sampling progress : 9%|▉ | 9/100 [00:37<06:37, 4.37s/it]Initialising ensemble of 100 walkers... Sampling progress : 12%|█▏ | 12/100 [00:51<06:27, 4.40s/it]Initialising ensemble of 100 walkers... Sampling progress : 100%|██████████| 100/100 [07:46<00:00, 4.67s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [07:48<00:00, 4.68s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [08:02<00:00, 4.83s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [07:59<00:00, 4.80s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [08:16<00:00, 4.97s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [08:29<00:00, 5.09s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [07:49<00:00, 4.69s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [08:01<00:00, 4.82s/it] global sampling for starting point ... Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Sampling progress : 2%|▏ | 2/100 [00:06<05:30, 3.37s/it]Initialising ensemble of 100 walkers... Sampling progress : 4%|▍ | 4/100 [00:14<05:52, 3.67s/it]Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Sampling progress : 4%|▍ | 4/100 [00:13<05:28, 3.42s/it]Initialising ensemble of 100 walkers... Sampling progress : 7%|▋ | 7/100 [00:24<05:47, 3.73s/it]Initialising ensemble of 100 walkers... Sampling progress : 12%|█▏ | 12/100 [00:43<05:23, 3.68s/it]Initialising ensemble of 100 walkers... Sampling progress : 100%|██████████| 100/100 [06:20<00:00, 3.81s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:06<00:00, 3.67s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:20<00:00, 3.81s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:49<00:00, 4.10s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:22<00:00, 3.83s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:57<00:00, 4.17s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:32<00:00, 3.93s/it] global sampling for starting point ... Sampling progress : 100%|██████████| 100/100 [06:37<00:00, 3.97s/it] global sampling for starting point ... Initialising ensemble of 100 walkers... Sampling progress : 0%| | 0/100 [00:00<?, ?it/s]Initialising ensemble of 100 walkers... Sampling progress : 4%|▍ | 4/100 [00:15<06:25, 4.01s/it]Initialising ensemble of 100 walkers... Sampling progress : 1%| | 1/100 [00:02<04:38, 2.82s/it]Initialising ensemble of 100 walkers... Sampling progress : 2%|▏ | 2/100 [00:06<05:44, 3.51s/it]Initialising ensemble of 100 walkers... Sampling progress : 6%|▌ | 6/100 [00:22<05:55, 3.79s/it]Initialising ensemble of 100 walkers... Sampling progress : 7%|▋ | 7/100 [00:25<05:44, 3.71s/it]Initialising ensemble of 100 walkers... Sampling progress : 9%|▉ | 9/100 [00:32<05:38, 3.72s/it]Initialising ensemble of 100 walkers... Sampling progress : 100%|██████████| 100/100 [06:41<00:00, 4.01s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00196707 1.00135042 1.0016244 1.00179508 1.00148762 1.00048871] (<1.010 is good) rhat chain diagnostic: [1.00196707 1.00135042 1.0016244 1.00179508 1.00148762 1.00048871] (<1.010 is good) [autoemcee] converged!!! converged!!!
(40000, 6)
FINAL STATS: Omega_m: 0.15639802639816397 Omega_k: 0.009976093154957207 Omega_r: 0.0001 Omega_BD: 0.8335258804468789 H0: 67.12178862560644 Psi_0: 1.778048376542223 F0: 0.05379493502536032 omega_BD: 77571.26985710289
Sampling progress : 100%|██████████| 100/100 [06:33<00:00, 3.94s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00352015 1.00098207 1.00353306 1.00162517 1.00149577 1.00149548] (<1.010 is good) rhat chain diagnostic: [1.00352015 1.00098207 1.00353306 1.00162517 1.00149577 1.00149548] (<1.010 is good) [autoemcee] converged!!! converged!!!
(40000, 6)
FINAL STATS: Omega_m: 0.15961888713916716 Omega_k: -0.0017439295660335412 Omega_r: 0.0001 Omega_BD: 0.8420250424268664 H0: 67.04154582194079 Psi_0: 1.6452424283907559 F0: 0.04673236230011184 omega_BD: 94912.94520157117
Sampling progress : 100%|██████████| 100/100 [06:11<00:00, 3.71s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00231399 1.00132766 1.00357955 1.00323934 1.00119348 1.00196544] (<1.010 is good) rhat chain diagnostic: [1.00231399 1.00132766 1.00357955 1.00323934 1.00119348 1.00196544] (<1.010 is good) [autoemcee] converged!!! converged!!!
(40000, 6)
FINAL STATS: Omega_m: 0.15377912261720894 Omega_k: -0.008967959525578545 Omega_r: 0.0001 Omega_BD: 0.8550888369083696 H0: 67.09728819489996 Psi_0: 1.0252677746640286 F0: 0.050570090391859904 omega_BD: 91669.5186298668
Sampling progress : 100%|██████████| 100/100 [06:54<00:00, 4.14s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00296294 1.00186939 1.00308325 1.00181591 1.00093337 1.00169235] (<1.010 is good) rhat chain diagnostic: [1.00296294 1.00186939 1.00308325 1.00181591 1.00093337 1.00169235] (<1.010 is good) [autoemcee] converged!!! converged!!!
(40000, 6)
FINAL STATS: Omega_m: 0.1542922882734884 Omega_k: 0.009448010536412676 Omega_r: 0.0001 Omega_BD: 0.836159701190099 H0: 67.05075560438242 Psi_0: 0.7469678533268315 F0: 0.05318249713731035 omega_BD: 85476.51323950474
Sampling progress : 100%|██████████| 100/100 [06:43<00:00, 4.03s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00197528 1.00071091 1.00088684 1.00135168 1.00113907 1.00221186] (<1.010 is good) rhat chain diagnostic: [1.00197528 1.00071091 1.00088684 1.00135168 1.00113907 1.00221186] (<1.010 is good) [autoemcee] converged!!! converged!!!
(40000, 6)
FINAL STATS: Omega_m: 0.16008718311732176 Omega_k: -0.004770922417308488 Omega_r: 0.0001 Omega_BD: 0.8445837392999868 H0: 67.03221065700268 Psi_0: 0.694703657234769 F0: 0.04613852509662548 omega_BD: 90203.55263342986
Sampling progress : 100%|██████████| 100/100 [06:38<00:00, 3.98s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00490598 1.00064389 1.00204776 1.00377143 1.00195137 1.00121317] (<1.010 is good) rhat chain diagnostic: [1.00490598 1.00064389 1.00204776 1.00377143 1.00195137 1.00121317] (<1.010 is good) [autoemcee] converged!!! converged!!!
(40000, 6)
FINAL STATS: Omega_m: 0.1556382353091075 Omega_k: -0.004391224813644421 Omega_r: 0.0001 Omega_BD: 0.8486529895045369 H0: 67.02538342993125 Psi_0: 1.2974283695196227 F0: 0.059922563630883094 omega_BD: 60031.74758036217
Sampling progress : 100%|██████████| 100/100 [06:22<00:00, 3.83s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.0023371 1.00304344 1.00268818 1.00164072 1.00077296 1.00134696] (<1.010 is good) rhat chain diagnostic: [1.0023371 1.00304344 1.00268818 1.00164072 1.00077296 1.00134696] (<1.010 is good) [autoemcee] converged!!! converged!!!
(40000, 6)
FINAL STATS: Omega_m: 0.15560991114289502 Omega_k: 0.00561150977046878 Omega_r: 0.0001 Omega_BD: 0.8386785790866362 H0: 67.02120194124274 Psi_0: 1.5676703673587582 F0: 0.050339341500386 omega_BD: 85869.17672688983
Sampling progress : 100%|██████████| 100/100 [06:21<00:00, 3.81s/it] checking convergence (iteration 1) ... acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths acceptance rates: [100 100 100 100 100 100 100 100]% (worst few) autocorrelation length: tau=inf -> 0x lengths [autoemcee] rhat chain diagnostic: [1.00215479 1.0014721 1.00254023 1.00172355 1.001385 1.00166689] (<1.010 is good) rhat chain diagnostic: [1.00215479 1.0014721 1.00254023 1.00172355 1.001385 1.00166689] (<1.010 is good) [autoemcee] converged!!! converged!!!
(40000, 6)
FINAL STATS: Omega_m: 0.15477634205458007 Omega_k: 0.0034101599390712715 Omega_r: 0.0001 Omega_BD: 0.8417134980063488 H0: 67.05942282344456 Psi_0: 1.751412846488866 F0: 0.05372648650854242 omega_BD: 80885.69300612016
mpirun -np 8 18873.29s user 104.51s system 788% cpu 40:05.43 total
(my command of running is :
$ time mpirun -np 8 python3.9 BD_MCMC_autoemcee.py
I don't know how to get a unique FINAL STATS, it is a weird issue.
I am using mpirun version :
$ mpirun --version mpirun (Open MPI) 4.1.5
Report bugs to http://www.open-mpi.org/community/help/
Any help is welcome.