Issue with restarting chains with MPI

andrejobuljen commented 1 year ago

Hi,

Sorry if I missed something, but I really tried going through all previous issues concerning restarting chains with MPI, and I couldn’t resolve the issue I’m having. I think the closest issues I noticed are this one and #189. I am using Monte Python v3.5.0 with CLASS v2.6.3.

I first do this: python montepython/MontePython.py run -p input/my.param -o chains/mpirun/ -f 0

Then I run an initial MPI run doing: mpirun -np 8 python montepython/MontePython.py -p input/my.param -o chains/mpirun/ -N 10

Then I try restarting doing: mpirun -np 8 python montepython/MontePython.py run -p input/my.param -o chains/mpirun/ -r chains/mpirun/2022-12-15_10__1.txt -N 50

What happens is that the new chains are created and they finish, however my job script fails in the end. I notice that the initial chain (… __1.txt) is removed. When I have a look at the restarted chains, it seems like they were not restarted from their original chains, but from some other value which I am not even sure what it is. I am really confused at what is happening, so I was wondering if you could you please help out?

I am attaching below the job output file.

Thanks and best, Andrej

mpirun_5780.txt

brinckmann commented 1 year ago

Hi Andrej,

There's definitely something odd going on here. We've had problems on some clusters where it wasn't properly dealing with the file number iteration (the 1, 2 etc) and initially I thought some part of that is happening here, but now I'm fairly convinced it's just bugs in the code. It seems to be correctly creating files with names chains/mpirun/2022-12-15_60__1.txt through chains/mpirun/2022-12-15_60__8.txt seen from e.g. Creating chains/mpirun/2022-12-15_60__5.txt

but then it claims to be restarting only from chains/mpirun/2022-12-15_10__1.txt seen from /!\ Restarting from chains/mpirun/2022-12-15_10__1.txt. Using associated however, I think this is not actually correct, it just doesn't understand what it's copying from and is parroting back what was passed with -r instead of actually printing the right information and this would be incorrect for mpi where we're only passing the first file [bug, note to self: line 1077 of montepython/parser_mp.py]. I need you to check what's in the files and if it matches what's in the old files, just to make sure, e.g. does the initial lines in the new 1 file match the old 1 file and new 8 match old 8. The old file should have been copied to the start of the new one.

and at the end it tries to delete chains/mpirun/2022-12-15_101.txt over and over again, which results in an error after the first time (since the file is now already gone. seen from deleting starting point of the chain chains/mpirun/2022-12-15_10__1.txt and subsequent errors `FileNotFoundError: [Errno 2] No such file or directory: 'chains/mpirun/2022-12-15_101.txt' This is another bug. Line 878 ofmontepython/mcmc.pyremovescommand_line.restart`, but again, this is the original file input with __1 and not the new filename.

This last bug is easy to fix, after line 303 of montepython/io_mp.py you should add

        # This new filename needs to be passed back to the code, so the correct file is removed later                                                                     
        command_line.restart = restartname

I actually want to change the behavior so it removes the old files once it's done copying to new files (instead of at the end of the run), but I never got around to implementing this and usually do it by hand once I'm sure it's done copying or after some hours once I get around to it. It's very rare I have runs hit the limit on -N and it's usually wallclock limit stopping a run or me aborting manually due to seeing the chains are converged, and in both cases it doesn't remove the old files.

The first bug doesn't change the functionality of the code as it's just printing the wrong information, so we can safely ignore that for now as I'm not sure off the top of my head how to fix it.

Let me know if this seems to solve your problems. It should fix the errors, but you also mentioned it seems to be starting from the wrong point, so get back to me about that if it still seems to be working incorrectly. I'm sorry about the bugs!

Best, Thejs

andrejobuljen commented 1 year ago

Hi Thejs,

Thank you very much for your quick reply! I just had a look at the files, and the initial lines do match the original files (for 2-8, the first one is gone now).

I'll definitely try to implement the fix you suggested to resolve the errors.

Concerning the starting point in the restarted chains, what I find odd (perhaps wrongly) is that in general the first step after restarting has drastically larger -lnLkl, i.e. it's like it starts from some very far away place. For example, in one of the chains, in the initial run -lnLkl went from ~1e4 to ~1e2 in few steps, while when it restarts it goes back to 1e3. Of course here we're talking about order of ~10s of steps, but I noticed this behaviour when running longer chains (N=1e5). Maybe I'm missing something here...

Thanks again and best, Andrej

brinckmann commented 1 year ago

Hi Andrej,

Okay I tested this and once the io_mp.py bugfix is applied I get consistent behavior where the starting point is the last entry of the previous chain, for all chains. I checked this at the step proposal level, to make sure it really started from the correct point. Without the bugfix the starting point was the last point of the first chain, i.e. the __1 chain, which might have led you to see some unexpected behavior beyond what simply random steps could explain. Probably not significant enough to bias the results of an MCMC run, but a bug nevertheless.

If a bestfit file is passed this behavior is overwritten - in that case it starts from the bestfit. You're not doing that, I'm just leaving a comment about it in case it's relevant for someone.

Best, Thejs

andrejobuljen commented 1 year ago

Hi Thejs,

Thanks, I applied your bug fix and my (test) script doesn't fail anymore. I'm attaching below the job output file. It still says: /!\ Restarting from chains/mpirun/2022-12-16_160__1.txt. Using associated log.param but I'm not sure if that's correct, as you mentioned earlier. Also I'm not completely sure this fixed the starting points of previous chains, as I still see drastic lnLkl changes after the restart, perhaps I should check this with larger number of steps. If you have a quick tip on how to check this more robustly (like you did at the proposal level) please let me know.

Best, Andrej

mpirun_r_6399.txt

brinckmann commented 1 year ago

Hi Andrej,

After line 115 of montepython/mcmc.py you can print vector (I printed each element of vector to not get rounded numbers) and compare to the last entry of the chains you're restarting from. For the very first step these should be the same. This is the starting point for any jump. For fast sampling (default) line 162 should hopefully assure you that you're taking a step from this point based on the random numbers generated from the preceding part and the Cholesky decomposition of the covmat. Line 465-466 for the first time called the previous function you were looking inside, which is where Cholesky is passed (which was computed on line 353 from the covmat, which was loaded on line 297 using a function that you can find in montepython/sampler.py), while line 455 calls the function that read the last step of the chains to define data.mcmc_parameters[elem]['last_accepted'] used on line 11 to define vector (I also checked inside that function, which can be found in montepython/sampler.py that it was loading correctly).

I hope this helps.

Best, Thejs

brinckmann / montepython_public

Issue with restarting chains with MPI #304