Closed fkoehlin closed 5 years ago
Hi @fkoehlin. Many thanks for this.
Is this something that occurs on every run, or just occasionally? Does your suggested fix work in the normal cases, or just the broken ones? If the former, then please feel free to submit a small pull request with this change.
Hi Will,
Thanks for your reply.
Tunrs out that all of the 11 mpiruns I set up over the weekend 'failed' (I think the runs are fine, it's just some hiccup at the final write-out?!) with the error message copied below and it seems like my fix didn't work for the 10 slightly slower mpiruns that just finished today... (the error message was exactly the same for the first run but quoted self.avnlikeslice = float(line.split()[3])
).
The traceback part down to the ValueError
is repeated for each mpirun slave and I've just copied the final repetition:
Traceback (most recent call last):
File "/home/fkoehlin/soft/montepython_public_v3/montepython/MontePython.py", line 38, in <module>
sys.exit(mpi_run())
File "/.../montepython_public_v3/montepython/run.py", line 141, in mpi_run
sampler.run(cosmo, data, command_line)
File "/.../montepython_public_v3/montepython/sampler.py", line 51, in run
pc.run(cosmo, data, command_line)
File "/.../montepython_public_v3/montepython/PolyChord.py", line 376, in run
polychord_run(loglike, nDims, nDerived, settings, prior)
File "/home/user/.local/lib/python2.7/site-packages/pypolychord-1.16-py2.7-linux-x86_64.egg/pypolychord/__init__.py", line 227, in run_polychord
return PolyChordOutput(settings.base_dir, settings.file_root)
File "/home/user/.local/lib/python2.7/site-packages/pypolychord-1.16-py2.7-linux-x86_64.egg/pypolychord/output.py", line 94, in __init__
self.avnlikeslice = float(line.split()[4])
ValueError: could not convert string to float: (
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[39046,1],0]
Exit code: 1
--------------------------------------------------------------------------
Could this be related to an outdated package somewhere?! I find it peculiar to get the same error message despite 'fixing' the list index for these runs and which solved the issue for the first run...
UPDATE:
Maybe my fix didn't apply yet to the subsequent mpiruns due to some local copies of the code and modifying output.py
during the run (?!), but just re-starting the runs (with output.py
including my bugfix) solves the issue and writes everything out without any more error messages(as was the case for the first run).
The output of the offending line.split()
is in my case
['<nlike>:', '0.00', '0.00', '(', '0.00', '0.00', 'per', 'slice', ')']
,
so indeed line.split()[3]
will return the string (
, whereas line.split()[4]
returns a float (0.00
, not sure if that's the one intended though...).
Cheers,
Fabian
The primary questions are a) Why is this mpi-dependent? b) Is this behaviour reproducable on a fresh run?
Could you provide the content of the .stats file for that run?
OK, I think that I've now fixed this. It seems like the issue was occurring when multiple speeds were in use. If you pull down the latest changes to master things should work.
Hi all,
At the end of an mpirun of pypolychord one of the slave runs aborted (maybe it was just the final write-out step and is completely unrelated to the mpirun...) due to a
ValueError
when trying to executeline 94
inoutput.py
(and thetry
-except
statement didn't catch it because it's checking only for aNameError
):self.avnlikeslice = float(line.split()[3])
This seemed to result in trying to convert an opening bracket string
(
into a float and hence failed. Moving the index up by one seems to solve the issue (as I assumed it's trying to read a number embedded in brackets):self.avnlikeslice = float(line.split()[4])
Cheers,
Fabian