PolyChord / PolyChordLite

Public version of PolyChord: See polychord.co.uk for PolyChordPro
https://polychord.io/
Other
83 stars 26 forks source link

Bug (with potential fix) in pypolychord's output.py #20

Closed fkoehlin closed 5 years ago

fkoehlin commented 5 years ago

Hi all,

At the end of an mpirun of pypolychord one of the slave runs aborted (maybe it was just the final write-out step and is completely unrelated to the mpirun...) due to a ValueError when trying to execute line 94 in output.py (and the try-except statement didn't catch it because it's checking only for a NameError):

self.avnlikeslice = float(line.split()[3])

This seemed to result in trying to convert an opening bracket string ( into a float and hence failed. Moving the index up by one seems to solve the issue (as I assumed it's trying to read a number embedded in brackets):

self.avnlikeslice = float(line.split()[4])

Cheers,

Fabian

williamjameshandley commented 5 years ago

Hi @fkoehlin. Many thanks for this.

Is this something that occurs on every run, or just occasionally? Does your suggested fix work in the normal cases, or just the broken ones? If the former, then please feel free to submit a small pull request with this change.

fkoehlin commented 5 years ago

Hi Will,

Thanks for your reply. Tunrs out that all of the 11 mpiruns I set up over the weekend 'failed' (I think the runs are fine, it's just some hiccup at the final write-out?!) with the error message copied below and it seems like my fix didn't work for the 10 slightly slower mpiruns that just finished today... (the error message was exactly the same for the first run but quoted self.avnlikeslice = float(line.split()[3])). The traceback part down to the ValueError is repeated for each mpirun slave and I've just copied the final repetition:

Traceback (most recent call last):
  File "/home/fkoehlin/soft/montepython_public_v3/montepython/MontePython.py", line 38, in <module>
    sys.exit(mpi_run())
  File "/.../montepython_public_v3/montepython/run.py", line 141, in mpi_run
    sampler.run(cosmo, data, command_line)
  File "/.../montepython_public_v3/montepython/sampler.py", line 51, in run
    pc.run(cosmo, data, command_line)
  File "/.../montepython_public_v3/montepython/PolyChord.py", line 376, in run
    polychord_run(loglike, nDims, nDerived, settings, prior)
  File "/home/user/.local/lib/python2.7/site-packages/pypolychord-1.16-py2.7-linux-x86_64.egg/pypolychord/__init__.py", line 227, in run_polychord
    return PolyChordOutput(settings.base_dir, settings.file_root)
  File "/home/user/.local/lib/python2.7/site-packages/pypolychord-1.16-py2.7-linux-x86_64.egg/pypolychord/output.py", line 94, in __init__
    self.avnlikeslice = float(line.split()[4])
ValueError: could not convert string to float: (
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[39046,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Could this be related to an outdated package somewhere?! I find it peculiar to get the same error message despite 'fixing' the list index for these runs and which solved the issue for the first run...

UPDATE: Maybe my fix didn't apply yet to the subsequent mpiruns due to some local copies of the code and modifying output.py during the run (?!), but just re-starting the runs (with output.py including my bugfix) solves the issue and writes everything out without any more error messages(as was the case for the first run). The output of the offending line.split() is in my case ['<nlike>:', '0.00', '0.00', '(', '0.00', '0.00', 'per', 'slice', ')'], so indeed line.split()[3] will return the string (, whereas line.split()[4] returns a float (0.00, not sure if that's the one intended though...).

Cheers,

Fabian

williamjameshandley commented 5 years ago

The primary questions are a) Why is this mpi-dependent? b) Is this behaviour reproducable on a fresh run?

Could you provide the content of the .stats file for that run?

williamjameshandley commented 5 years ago

OK, I think that I've now fixed this. It seems like the issue was occurring when multiple speeds were in use. If you pull down the latest changes to master things should work.