baudren / montepython_public

Public repository for the Monte Python Code
MIT License
65 stars 115 forks source link

Infinite number of chain empty files ... #104

Open mmarianav opened 6 years ago

mmarianav commented 6 years ago

I have been using montepython in a local cluster for running RSD analysis, thus writing my own likelihoods, etc. Recently I move to a new cluster and trying to install everything again I have a very weird problem, thus the code does not show any error, but it generates infinite number of empty chain files, not sure where to look for the error, could you give me some advise.

bufeo commented 6 years ago

Did you try specifying the option --chain-number? I had a similar problem some time ago and that fixed it for me.

carlosggarcia commented 6 years ago

Same problem here. The cause is in montepython/io_mp.py in the following block:

         while trying:
            data.out = open(os.path.join(
                command_line.folder, outname_base)+str(suffix)+'.txt', 'w')
            try:
                lock(data.out, fcntl.LOCK_EX | fcntl.LOCK_NB)
                trying = False
            except LockError:
                suffix += 1

In my system, LockError message is (38, 'Function not implemented'). It must disabled in the cluster filesystem (it is mounted with type lustre (rw,lazystatfs)) . Using --chain-number option seems to work.

bhorowitz commented 6 years ago

If one wants to run multiple chains at once with MPI, how can one do so when fixing this problem by specifying the chain number? I'm try using the following:

for n in {1..5}; do python montepython/MontePython.py run -o test2 -p example.param -N 5000 --chain-number=$n; done

But it seems to just run it sequentially? I'm on NERSC (cori or edison), in case anyone has experience getting this to work there..

edit: it seems at least on Edison it can be circumvented by using lockf instead of flock.

brinckmann commented 6 years ago

Hi Ben,

To use MPI you should launch MontePython like you would launch other codes with MPI, e.g.:

mpirun -np 8 python montepython/MontePython.py run -o test2 -p example.param -N 5000

The code will then create a number of chains corresponding to the number of MPI processes.

The --chain-number flag is just used for custom numbering of the chains and is mostly useful for the fix above for circumventing cases where the disk and the python code aren't communicating properly, resulting in infinite chains being created.

However, as I suspect you're doing now, it is also possible to run parallel chains without MPI. For MontePython the difference is minimal and is mostly related to a small boost in efficiency, for some cases, when automatically updating the covariance matrix with the --update flag.

Best, Thejs

noller commented 6 years ago

I also encountered the "infinite empty chain file" problem, when running this with(!) mpirun on a cluster. When setting things up for a new output folder and param file and starting an initial run with mpirun, the code would get stuck in a loop creating these empty chain files. The following setup fixed that problem for me. Assuming one wants to run MontePython for an experiment/parameters as specified in some PARAM.param:

1) Choose a name FOLDERNAME that does not correspond to any folder in the chains subdirectory yet. 2) Run a minimal chain on the cluster from terminal without submitting to any cluster queue. So e.g. going into the montepython folder in your directory on the cluster and running:

python montepython/MontePython.py run -N 10 -p PARAM.param -o chains/FOLDERNAME

That process only has 10 steps, so should finish quickly and set up the folder correctly. The number of steps is not important, just that it is small. This should run without problems and produce some output.

3) Now submit the proper run to the queue on the cluster, with the same output directory.

With that setup, everything went through fine for me. This may be to do with using the log.param now instead of the original .param file or starting an mpirun with some chain(s) already present in the output directory somehow gets round the infinite empty chain generation. In any case, this solved the problem at least on my cluster, so hopefully this helps.

brinckmann commented 6 years ago

Thanks for pointing this out! I agree, it is always good practice to first create the folder on a cluster before launching a run (e.g. with -f 0, which doubles as a consistency check). Best, Thejs

mmarianav commented 6 years ago

Thanks for all the answers and sorry for taking longer to check. In fact I am using NERSC (edison and cori) and I am using MPI as @brinckmann mentioned: mpirun -np 8 python montepython/MontePython.py run -o test2 -p example.param -N 5000 Is exactly in this configuration that I found the problem.... Thus I assumed that the flag should be used at the end of this commands, but honestly I am not quite sure how to set this , could you be more specific? mpirun -np 8 python montepython/MontePython.py run -o test2 -p example.param -N 5000 --chain-number=

mmarianav commented 6 years ago

@bhorowitz , @ardok-m, @bufeo:

In fact I am using NERSC (edison and cori) and I am using MPI as @brinckmann mentioned: mpirun -np 8 python montepython/MontePython.py run -o test2 -p example.param -N 5000

It is in this configuration that I found the problem. Thus I assumed that the flag should be used at the end of this commands, but honestly I am not quite sure how to set this , could you be more specific?

mpirun -np 8 python montepython/MontePython.py run -o test2 -p example.param -N 5000 --chain-number=

brinckmann commented 6 years ago

To use --chain-number with MPI you pass it as e.g. mpirun -np 4 montepython/MontePython.py run -p input/example.param -o chains/test --chain-number 1 Which will launch chains with suffix 1, 2, 3, 4 (i.e. starting from 1 and adding 1 for each mpi process up to -np, which in this case is 4).

Best, Thejs