Multiprocessing Issues on Ubuntu

steven-murray commented 5 years ago

Investigating issues that @catherinewatkinson has been having with multiprocessing.

Using commit 2070c212778d9f1ba441fc7861db1068e1ca64db, under Arch [Python=3.6, GCC=8.2, Linux Kernel=4.19] and reportedly MacOS (@BradGreig can verify exact parameters) the following script runs through without hitch:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np

from py21cmmc.mcmc import analyse
from py21cmmc import mcmc

print('initialising core')
core = mcmc.CoreCoevalModule( redshift = [7,8,9], user_params = dict(HII_DIM = 50, BOX_LEN = 125.0 ), regenerate=False )

datafiles = ["data/simple_mcmc_data_%s.npz"%z for z in core.redshift]
print('initialising likelihood')

likelihood = mcmc.Likelihood1DPowerCoeval(   datafile = datafiles, noisefile= None, logk=False, min_k=0.1, max_k=1.0, simulate = True )

model_name = "SimpleTest"
print('setting off mcmc chain')

chain = mcmc.run_mcmc(core, likelihood, datadir='data', model_name=model_name, params=dict( HII_EFF_FACTOR = [30.0, 10.0, 50.0, 3.0], ION_Tvir_MIN = [4.7, 2, 8, 0.1],), walkersRatio=2, burninIterations=0, sampleIterations=10, threadCount=2, continue_sampling=False )

However, under Ubuntu [v16.04, Python=3.7.0, GCC=5.4.0, Linux Kernel=4.15], it prints the following:

initialising core
initialising likelihood
setting off mcmc chain
global_xH = 7.426495e-01
ave Tb = 1.766356e+01
global_xH = 5.333566e-01
ave Tb = 1.095911e+01
global_xH = 2.330818e-01
ave Tb = 3.908440e+00
INFO:cosmoHammer:Using CosmoHammer 0.6.1
INFO:cosmoHammer:Using emcee 2.2.1
INFO:cosmoHammer:all burnin iterations already completed
INFO:cosmoHammer:Sampler: <class 'py21cmmc.mcmc.cosmoHammer.CosmoHammerSampler.CosmoHammerSampler'>
configuration: 
  Params: [30.   4.7]
  Burnin iterations: 0
  Samples iterations: 10
  Walkers ratio: 2
  Reusing burn in: True
  init pos generator: SampleBallPositionGenerator
  stop criteria: IterationStopCriteriaStrategy
  storage util: <py21cmmc.mcmc.cosmoHammer.storage.HDFStorageUtil object at 0x7fca91694eb8>
likelihoodComputationChain: 
Core Modules: 
  CoreCoevalModule
Likelihood Modules: 
  Likelihood1DPowerCoeval

INFO:cosmoHammer:start sampling after burn in
global_xH = 7.769698e-01
ave Tb = 1.876334e+01
global_xH = 7.545230e-01
ave Tb = 1.804122e+01
global_xH = 5.782137e-01
ave Tb = 1.211104e+01
global_xH = 5.505636e-01
ave Tb = 1.139492e+01
global_xH = 2.715125e-01
ave Tb = 4.665063e+00
global_xH = 2.498615e-01
ave Tb = 4.228604e+00
global_xH = 7.035281e-01
ave Tb = 1.647911e+01
global_xH = 7.942268e-01
ave Tb = 1.931295e+01
global_xH = 4.767080e-01
ave Tb = 9.584994e+00
global_xH = 6.233207e-01
ave Tb = 1.323970e+01
global_xH = 1.820869e-01
ave Tb = 2.963232e+00
global_xH = 3.428975e-01
ave Tb = 6.030039e+00
global_xH = 7.808089e-01
ave Tb = 1.888300e+01
global_xH = 7.715399e-01
ave Tb = 1.857948e+01
global_xH = 5.901023e-01
ave Tb = 1.239779e+01
global_xH = 5.819135e-01
ave Tb = 1.217400e+01
global_xH = 2.889247e-01
ave Tb = 4.991927e+00
global_xH = 2.877289e-01
ave Tb = 4.948007e+00
global_xH = 6.930866e-01
ave Tb = 1.616706e+01
global_xH = 8.097851e-01
ave Tb = 1.983140e+01
global_xH = 4.619398e-01
ave Tb = 9.235161e+00
global_xH = 6.522067e-01
ave Tb = 1.400839e+01
global_xH = 1.697895e-01
ave Tb = 2.742611e+00
global_xH = 3.858452e-01
ave Tb = 6.897256e+00
global_xH = 7.753115e-01
ave Tb = 1.871265e+01
global_xH = 7.739930e-01
ave Tb = 1.865880e+01
global_xH = 5.731614e-01
ave Tb = 1.199083e+01
global_xH = 5.864297e-01
ave Tb = 1.228809e+01
global_xH = 2.641690e-01
ave Tb = 4.529539e+00
global_xH = 2.934669e-01
ave Tb = 5.057684e+00
global_xH = 7.199152e-01
ave Tb = 1.697748e+01
global_xH = 7.904674e-01
ave Tb = 1.919020e+01
global_xH = 4.969881e-01
ave Tb = 1.008020e+01
global_xH = 6.141037e-01
ave Tb = 1.300531e+01
global_xH = 1.969990e-01
ave Tb = 3.242450e+00
global_xH = 3.271320e-01
ave Tb = 5.722654e+00
global_xH = 8.351841e-01
ave Tb = 2.071251e+01
global_xH = 7.727503e-01
ave Tb = 1.861785e+01
global_xH = 6.636423e-01
ave Tb = 1.441798e+01
global_xH = 5.801713e-01
ave Tb = 1.213919e+01
global_xH = 3.651847e-01
ave Tb = 6.596194e+00
global_xH = 2.829795e-01
ave Tb = 4.863751e+00
global_xH = 6.915270e-01
ave Tb = 1.611861e+01
global_xH = 7.832422e-01
ave Tb = 1.896149e+01
global_xH = 4.628306e-01
ave Tb = 9.247712e+00
global_xH = 5.961384e-01
ave Tb = 1.255191e+01
global_xH = 1.727361e-01
ave Tb = 2.790330e+00
global_xH = 2.994680e-01
ave Tb = 5.189877e+00
global_xH = 7.786353e-01
ave Tb = 1.881429e+01
global_xH = 7.483967e-01
ave Tb = 1.784517e+01
global_xH = 5.833143e-01
ave Tb = 1.223257e+01
global_xH = 5.407947e-01
ave Tb = 1.114985e+01
global_xH = 2.782767e-01
ave Tb = 4.791835e+00
global_xH = 2.396872e-01
ave Tb = 4.035534e+00
global_xH = 7.155115e-01
ave Tb = 1.684228e+01
global_xH = 7.679721e-01
ave Tb = 1.846377e+01
global_xH = 4.900319e-01
ave Tb = 9.913624e+00
global_xH = 5.711830e-01
ave Tb = 1.191339e+01
global_xH = 1.907794e-01
ave Tb = 3.128807e+00
global_xH = 2.716604e-01
ave Tb = 4.646045e+00
global_xH = 7.663427e-01
ave Tb = 1.842255e+01
global_xH = 7.493701e-01
ave Tb = 1.787595e+01
global_xH = 5.597976e-01
ave Tb = 1.164534e+01
global_xH = 5.423385e-01
ave Tb = 1.118895e+01
global_xH = 2.506189e-01
ave Tb = 4.264934e+00
global_xH = 2.411210e-01
ave Tb = 4.062866e+00
global_xH = 7.074982e-01
ave Tb = 1.660130e+01
global_xH = 7.667012e-01
ave Tb = 1.842872e+01
global_xH = 4.771960e-01
ave Tb = 9.608149e+00
global_xH = 5.661420e-01
ave Tb = 1.179611e+01
global_xH = 1.794857e-01
ave Tb = 2.924613e+00
global_xH = 2.638571e-01
ave Tb = 4.502406e+00
global_xH = 7.512004e-01
global_xH = 7.418423e-01
ave Tb = 1.763850e+01
ave Tb = 1.793591e+01
global_xH = 5.368532e-01
global_xH = 5.298493e-01
ave Tb = 1.087753e+01
ave Tb = 1.107044e+01
global_xH = 2.282684e-01
global_xH = 2.301930e-01
ave Tb = 3.868554e+00
ave Tb = 3.820558e+00
global_xH = 7.269898e-01
ave Tb = 1.718739e+01
global_xH = 7.576179e-01
ave Tb = 1.813685e+01
global_xH = 5.072139e-01
ave Tb = 1.032613e+01
global_xH = 5.526000e-01
ave Tb = 1.145173e+01
global_xH = 2.067074e-01
ave Tb = 3.420273e+00
global_xH = 2.506416e-01
ave Tb = 4.245888e+00

and then hangs indefinitely. The process does not seem to be consuming resources when viewed in top. The log file output is:

2018-12-06 10:31:36,821 INFO:Using CosmoHammer 0.6.1
2018-12-06 10:31:36,821 INFO:Using emcee 2.2.1
2018-12-06 10:31:36,867 INFO:all burnin iterations already completed
2018-12-06 10:31:36,868 INFO:Sampler: <class 'py21cmmc.mcmc.cosmoHammer.CosmoHammerSampler.CosmoHammerSampler'>
configuration: 
  Params: [30.   4.7]
  Burnin iterations: 0
  Samples iterations: 10
  Walkers ratio: 2
  Reusing burn in: True
  init pos generator: SampleBallPositionGenerator
  stop criteria: IterationStopCriteriaStrategy
  storage util: <py21cmmc.mcmc.cosmoHammer.storage.HDFStorageUtil object at 0x7fca91694eb8>
likelihoodComputationChain: 
Core Modules: 
  CoreCoevalModule
Likelihood Modules: 
  Likelihood1DPowerCoeval

2018-12-06 10:31:36,870 INFO:start sampling after burn in

Ubuntu is a clean install (virtual machine), with FFTW installed via sudo apt install libfftw3-3 libfftw3-dev libfftw3-single3 and GSL with sudo apt install gsl-bin libgsl0-dev. Python is installed by installing latest Anaconda (3.7). Then a new clean environment with python 3.7. Numpy, scipy, matplotlib and astropy are installed with conda, and all other dependencies with pip. 21CMMC installed from the source directory with pip install -e ..

Strangely, installing with pip install . seems to not install correctly.

steven-murray commented 5 years ago

Running with threadCount=1 runs through fine.

steven-murray commented 5 years ago

Oddly, using exactly the same setup but Python=3.6.6 resolved the issue.

A conda list in each environment shows that each has the same version of every package installed (except for the version of python for which they are compiled). Thus it seems to be a bug in python 3.7.

I guess the next thing to do would be to look at exactly where it's hanging in the code.

BradGreig commented 5 years ago

Given it appears to be a python 3.7 issue. Did you try creating a conda environment with python 3.7 on your machine (where you are usually working). That'll help confirm that it is a python 3.7 issue at the very least

steven-murray commented 5 years ago

Excellent idea. will do.

steven-murray commented 5 years ago

Of course this would happen... clean py3.7 env on Arch works perfectly fine :unamused:

BradGreig commented 5 years ago

Bugger!

ghost commented 5 years ago

So to add to the fun: pip install . works fine for me (although I have to upgrade pip first). Interestingly, I have been experiencing the hanging problem on python 3.6.5 not 3.7; my problems are also alleiviated by running with python 3.6.6.

steven-murray commented 5 years ago

Hmm. The behaviour actually looks exactly like what happens when Python (multi)processes exit without returning (i.e. segfault or similar). I've been having a look around and there doesn't seem to be any good way to counteract this from Python. That is, it doesn't seem like Python is able to "catch" a segfault and continue doing stuff, at least, not when the segfault happens in a sub-process.

That means that we'll have to write absolutely "correct" C code, in the sense that it should never segfault. I will attempt to run the code as it stands through valgrind to see if such a segfault is occurring. I'll also create another issue devoted to creating a standard way to raise errors in C code that can be caught in Python.

steven-murray commented 5 years ago

I was able to reproduce what looks like this issue (it may not be exactly the same) by artificially adding the following as the first clause inside ComputeIonizedBox:

    if (astro_params->HII_EFF_FACTOR > 30){
        *(int*)0 = 0;
    }

This almost certain forces a SEGFAULT. Doing so means that the initial data will not segfault (because it has HII_EFF_FACTOR=30, but somewhere along the line, one of the walkers will get a segfault.

The result for me was that the whole program hung. I will still able to cancel successfully with CTRL+C.

BradGreig commented 5 years ago

@caw11 can you try running your example with ION_Tvir_MIN = [4.7, 4, 6, 0.1] instead of what you currently have. The range [4,6] is the range we have used in previous publications and thus should not throw up any segfaults. This should verify whether it is a segfault causing the hang or something else.

@steven-murray if the correct parameter ranges are passed to the mcmc sampler (i.e. the flat priors used in any previous publication) this should result in it being free of segfaults. Segfaults are likely only to occur when larger parameter ranges are chosen.

steven-murray commented 5 years ago

In commit ff16ee20969fdaed83f116e94cdfe4db38223fd5, the above will cause the program to crash, rather than hanging, which I think is much more desirable. It cannot report why it crashed, but at least is a little less mystifying to the user, who can then make an issue from the error.

@BradGreig, that would be my fault for setting those ranges in the doc example (will fix now). Nevertheless, we need to be able to catch cases where bad parameter ranges/combinations are used and either return meaningful info to the user or silently continue the MCMC with lnl of -inf. It is kind of hard to know as a user that a range of (4,6) will work, but (3,7) won't.

BradGreig commented 5 years ago

No problems, it has resulted in interesting behaviour at the very least. Which is the main thing. I ran it on my end for (2,8) and it still worked perfectly fine. But maybe its only right on the edge which weren't sampled in my tests.

I agree that there should be some documentation discussing sensible/viable parameter choices.

catherinewatkinson commented 5 years ago

@BradGreig it is still hanging when I run with the reduced range you suggest. I think given that it is pretty easy to amble into parameter space that is not meaningful that the behaviour as made to proceed as if the sample has moved outside of the prior range. I assume that this is connected to an unphysical regime somehow? Could you possibly clarify what you mean by "bad parameter ranges/combinations?

steven-murray commented 5 years ago

@caw11 Oh that's weird. When parameters outside the range are chosen, the behaviour is to immediately return -inf, without ever calling the C code, so I doubt it is bad parameter combinations that are causing it, if you are restricting the range. Can you run again with the latest commit on steven-develop? Also, if possible, can you run it under valgrind? To do this, look at https://github.com/BradGreig/Hybrid21CM/blob/develop-steven/docs/notes_for_developers.rst.

Let me know if you can't, or don't have time, to do the latter. I can do it, I just have to use a different OS so I have to plan for it :-).

catherinewatkinson commented 5 years ago

@steven-murray the latest commit is slower (but that might be connected to what other people are running on the workstation) and is still hanging. I'm afraid I don't really have time for mucking about with Valgrind etc right now, happy to keep running different versions on my system though, so just let me know when you need this.

steven-murray commented 5 years ago

@catherinewatkinson interesting that you are getting a slow-down. I am also getting a slow down, and I've noticed that our tests have increased in total time to run by a factor of almost 2 in the last few commits. I'll have a look into that.

More pertinently, I just ran the latest commit on my Ubuntu installation (described in opening post above), and got no hangs or crashes. I tried it with 2 threads and 3 threads. I tried it on py3.7 and py3.6.5, run twice on each version.

One thing that I may have gotten wrong on the first run above was that I constructed my virtual machine with only a single core, then tried running it with more than 1 process. This time I created the VM with 4 cores. I don't think this is the problem though.

I will revert to a previous commit and re-run the example to see if I can get the hang again.

steven-murray commented 5 years ago

This is getting really frustrating. I haven't been able to get anything to either hang or crash in all my tests today. I've tried combinations of all of the following:

Python: 3.7, 3.6.5, 3.6.6 VM: 1 core w/ 2GB ram, 1 core w/8GB ram, 4 cores w/8GB ram threads: 1, 2, 4, 8 py21cmmc versions: 2070c212778d9f1ba441fc7861db1068e1ca64db (original version from first post), e713c5124f89aa478aee5b3d83dcb1034b1f20f2, e42d5b4bdef6c7b5788a3b538b9f3c0b8a681c9c (currently latest).

Besides significant slow-downs with the latest commit, none of them behave differently. No crashes or hangs. Oh I've also tried running on two different host systems. For most combinations, I've run two attempts, just to increase the chances of getting a hang if it's a random thing.

It's frustrating because I'd like to be able to work through this myself, but I can't really do much if I can't get it to reproduce the problem.

@catherinewatkinson is it hanging every time for you (with python=3.6.5)? The only two things I can think of are to print out the parameters on every iteration (before diving into C), and to run under valgrind. The latter you don't want to do unless it's hanging every time, because it will take a while. On the other hand, I know you don't want to muck about with it, but if it is hanging every time, I'd be very grateful if you could just do the following, hopefully not taking too much of your own time:

$ sudo apt install valgrind
$ cd <Hybrid21CM directory>
$ pip uninstall py21cmmc
$ rm -rf build
$ DEBUG=1 pip install -e .
$ cd <directory with your example in it>
$ valgrind --tool=memcheck --track-origins=yes --suppressions=<Hybrid21CM>/devel/valgrind-suppress-all-but-c.supp python <your example script> > valgrind.output

Then send the valgrind.output file through!

steven-murray commented 5 years ago

Belay that. Don't do the valgrind thing. I'm running it now and I think it can be honed a bit.

steven-murray commented 5 years ago

BTW, I have now fixed the slowdown issue (so definitely use the newest commit if you can).

catherinewatkinson commented 5 years ago

@steven-murray happy to do that for you, very little overhead for me (I just didn't want to have to muck around trying to work out how to install and use Valgrind etc). Let me know what you want me to run and I will do it :)

steven-murray commented 5 years ago

OK, so if it crashes every time for you (on a given version), please try the above set of commands, except instead of the last one, do this:

PYTHONMALLOC=malloc valgrind --tool=memcheck --track-origins=yes --suppressions=<Hybrid21CM>/devel/valgrind-suppress-all-but-c.supp python <your example script> > valgrind.output.

I am not sure that this will give us an answer (a lot of the info from valgrind comes from when the program exits for some reason, which yours is not doing...), but it's worth trying. It will run really slow, beware :-).

steven-murray commented 5 years ago

If that doesn't work, I'll try making a super-duper-debug branch for you which just prints everything as it goes, to see if we can't hone in on where things are going wrong.

catherinewatkinson commented 5 years ago

Many apologies, I did this on the day you messaged, but didn't commit the comment.

valgrind_to_screen_interupted_after_extended_hang.txt

I added a couple of flags to stop it suppressing errors and to trace memory leaks... See the attached. valgrind.output simply contained:

initialising core initialising likelihood setting off mcmc chain global_xH = 7.438030e-01 ave Tb = 1.767293e+01 global_xH = 3.818097e-01 ave Tb = 1.011786e+01 global_xH = 0.000000e+00 ave Tb = 0.000000e+00

It doesn't output the chain global_xH

When I kill it after reaching the hanging stages it prints out some info about the memory heap. There is too much of it to copy over, but the summary is: =12452== ==12452== HEAP SUMMARY: ==12452== in use at exit: 16,789,121 bytes in 97,901 blocks ==12452== total heap usage: 2,272,189 allocs, 2,174,288 frees, 672,789,177 bytes allocated ==12452==

steven-murray commented 5 years ago

Thanks @catherinewatkinson that's really really helpful. Brad and I will have a look through and try to identify the various warnings/errors it gives. To help us out with exact line numbers, can you point us to the exact git hash of the version you're using? Or the commit tag of the code you downloaded?

BradGreig commented 5 years ago

The vast majority of the C based errors seem to be arising from printing to screen ave_Tb and global_xH. I made this a simple option to turn this on and off in commit e34d73c7775b87ad0f63ff6e6bd1548b8c3a4715.

This is basically an initialised variable issue, despite the fact that these variables are initialised. The other C errors are of the similar type, that the variable is not initialised.

Not sure why these could cause the hang. But I'll look into cleaning them up

BradGreig commented 5 years ago

As in you can turn them off with OUTPUT_AVE=False in the flag_options struct

steven-murray commented 5 years ago

Cool, yeah it looked a bit like that to me. It is only causing hangs when multi-threading, so perhaps it's some kind of race condition when trying to write to screen. Usually it just garbles it, so it's a bit weird. @catherinewatkinson if you use the very latest commit, and set flag_options = {"OUTPUT_AVE":False} in the core, do you still get the hang? If so, can you run exactly the same valgrind run with the latest version?

catherinewatkinson commented 5 years ago

Still hangs I'm afraid, on both systems. Running the following script:

two_params_dot_py.txt

Valgrind returns the following output: valgrind_Hermes_OUTPUT_AVE_False_19oct2018.txt

steven-murray commented 5 years ago

Hey @catherinewatkinson , thanks for that. Did you do "rm -rf build" in the main directory, and then "pip install -e ." again? It seems like the valgrind output is very very similar, if not exactly the same, as the previous one, so I just want to be sure it's actually using the newest C code.

BradGreig commented 5 years ago

It looks like the new one to me. It has fewer "uninitialised value" errors as the printf statement isn't called (which means its the new version).

I think my latest commit (58e83f6bb7bef74339dc933494ba93dc9837e17a) removes the remainder of those issues. That being said, I do not suspect that'll actually fix the issue.

Nevertheless, could you try that latest commit through valgrind.

I have been running valgrind on my machine (laptop) and I am unable to replicate this problem either.

catherinewatkinson commented 5 years ago

valgrind_Hermes_Brad_branch_20oct2018.txt

Still hanging

steven-murray commented 5 years ago

Goodness this is frustrating. There are no warnings or errors from 21CMMC in that output! The only warnings come from other libraries, and they don't seem to occur around the place that the code is hanging anyway. Furthermore, we know none of the threads is crashing, otherwise Python would pick it up and exit. I don't know any other way of testing what's going on here... :-(.

Maybe the best bet is to go back and put in print statements to try to locate where the issue is. If I could run it on my system and get the hang this process would be a lot easier, but how about we try it anyway. I'll send you some code soon.

catherinewatkinson commented 5 years ago

Sure, not a problem at all, would be good to get to the bottom of what is going on.

steven-murray commented 5 years ago

OK, there's now a branch called "super-debug". Grab that, recompile as follows:

$ rm -rf build
$ LOG_LEVEL=3 pip install .

and run your script (not under valgrind). It'll probably print out a lot of stuff, so pipe it into a file :-). Hopefully this'll catch it.

catherinewatkinson commented 5 years ago

Apologies for the delay. Here is the output, I piped to file with > super_debug_out_03_01_2019.txt; some still printed to screen, this I've copied and pasted to super_debug_out_to_screen_03_01_2019.txt

super_debug_out_03_01_2019.txt super_debug_out_to_screen_03_01_2019.txt

steven-murray commented 5 years ago

Thanks for that! It seems that the ionization routine is finishing fine for each process. Just to make sure: this run still hung, right?

I've pushed a new version to that same branch which adds a few more debugging print statements. To get them to print, you'll have to add

import logging
logging.getLogger("21CMMC").setLevel(logging.DEBUG)

to you script. I'm assuming that the hang is happening to one of the processes in the hand-off from C to Python, so hopefully this will help us narrow that down.

catherinewatkinson commented 5 years ago

Still hanging on python 3.6.5 yes. New output from the updated version: super_debug_out_07_01_2019.txt super_debug_out_to_screen_07_01_2019.txt

steven-murray commented 5 years ago

So it is returning from the python wrapper function in all four processes. So the error is happening somewhere between where the low-level wrapper finishes and the next iteration starts -- potentially somewhere inside emcee. I've pushed a few more changes with more debugging in it. Could you try that?

catherinewatkinson commented 5 years ago

That didn't get very far: (py36) caw11@ph-jpritcha-1:~$ python drive_two_params.py > super_debug_out_08_01_2019 INFO:21CMMC:Initializing init and perturb boxes for the entire chain. INFO:21CMMC:Existing init_boxes found and read in (seed=68456558803). INFO:21CMMC:Existing z=7 perturb_field boxes found and read in (seed=68456558803). INFO:21CMMC:Existing z=8 perturb_field boxes found and read in (seed=68456558803). INFO:21CMMC:Existing z=9 perturb_field boxes found and read in (seed=68456558803). INFO:21CMMC:Initialization done. Traceback (most recent call last): File "drive_two_params.py", line 24, in chain = mcmc.run_mcmc(core, likelihood, datadir='data', model_name=model_name, params=dict( HII_EFF_FACTOR = [30.0, 10.0, 50.0, 3.0], ION_Tvir_MIN = [4.7, 4, 6, 0.1],), walkersRatio=2, burninIterations=0, sampleIterations=10, threadCount=4, continue_sampling=False ) File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/mcmc.py", line 95, in run_mcmc chain = build_computation_chain(core_modules, likelihood_modules, params) File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/mcmc.py", line 39, in build_computation_chain if setup: chain.setup() File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/LikelihoodComputationChain.py", line 142, in setup cModule.setup() File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/likelihood.py", line 291, in setup super().setup() File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/likelihood.py", line 123, in setup simctx = self.chain.simulate_mock() File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/LikelihoodComputationChain.py", line 71, in simulate_mock core.simulate_mock(ctx) File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/core.py", line 153, in simulate_mock self.build_model_data(ctx) File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/core.py", line 336, in build_model_data logger.debug(f"PID={os.getpid()} Updating parameters: {ctx.getParams()}") NameError: name 'os' is not defined

steven-murray commented 5 years ago

Damnit, should have run a quick test myself. Sorry about that. Use latest.

On Tue, Jan 8, 2019 at 5:18 AM catherinewatkinson notifications@github.com wrote:

That didn't get very far: (py36) caw11@ph-jpritcha-1:~$ python drive_two_params.py > super_debug_out_08_01_2019 INFO:21CMMC:Initializing init and perturb boxes for the entire chain. INFO:21CMMC:Existing init_boxes found and read in (seed=68456558803). INFO:21CMMC:Existing z=7 perturb_field boxes found and read in (seed=68456558803). INFO:21CMMC:Existing z=8 perturb_field boxes found and read in (seed=68456558803). INFO:21CMMC:Existing z=9 perturb_field boxes found and read in (seed=68456558803). INFO:21CMMC:Initialization done. Traceback (most recent call last): File "drive_two_params.py", line 24, in chain = mcmc.run_mcmc(core, likelihood, datadir='data', model_name=model_name, params=dict( HII_EFF_FACTOR = [30.0, 10.0, 50.0, 3.0], ION_Tvir_MIN = [4.7, 4, 6, 0.1],), walkersRatio=2, burninIterations=0, sampleIterations=10, threadCount=4, continue_sampling=False ) File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/mcmc.py", line 95, in run_mcmc chain = build_computation_chain(core_modules, likelihood_modules, params) File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/mcmc.py", line 39, in build_computation_chain if setup: chain.setup() File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/LikelihoodComputationChain.py", line 142, in setup cModule.setup() File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/likelihood.py", line 291, in setup super().setup() File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/likelihood.py", line 123, in setup simctx = self.chain.simulate_mock() File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/LikelihoodComputationChain.py", line 71, in simulate_mock core.simulate_mock(ctx) File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/core.py", line 153, in simulate_mock self.build_model_data(ctx) File "/home/caw11/anaconda2/envs/py36/lib/python3.6/site-packages/py21cmmc/mcmc/core.py", line 336, in build_model_data logger.debug(f"PID={os.getpid()} Updating parameters: {ctx.getParams()}") NameError: name 'os' is not defined

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BradGreig/Hybrid21CM/issues/18#issuecomment-452278765, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNo3qSV7XR_iybQ2wWOxSrV77FUVpQ_ks5vBIybgaJpZM4ZHHNc .

-- Steven Murray, Research Associate at School of Earth and Space Exploration, Arizona State University.

catherinewatkinson commented 5 years ago

I'm assuming it is enough to have just the end section where it hangs from here on? Let me know if not and I will send you everything as before...

steven-murray commented 5 years ago

Hey thanks. Can you send the whole output? This looks promising, but want to check some stuff further up the line.

catherinewatkinson commented 5 years ago

super_debug_out_to_screen_08_01_2019.txt super_debug_out_08_01_2019.txt

steven-murray commented 5 years ago

Okay I think we're getting closer :-) All of the processes are falling down when computing the power spectrum. Have you got pyFFTW installed? If so, try uninstalling it and running again. If not, let me know and I'll see what I can dig through.

ghost commented 5 years ago

I will try this first thing in the morning. But coincidentally my students messaged saying that when they installed pyfftw on one of their accounts it stopped working again. When they uninstalled it, it started working again. So I am expecting to find the same.

steven-murray commented 5 years ago

Awesome. Yay! So this is an issue with either powerbox or pyfftw, not 21CMMC, probably. Let me know. Since I own powerbox, I'll look into if that's causing the issue.

catherinewatkinson commented 5 years ago

So I can confirm that uninstalling pyfftw stops super-debug from stalling under python 3.6.5. However there remains a bug of some description, as it crashes out complaining from likelihood.py that 'NameError: name 'lnL' is not defined'. See the below output summary for info.

super_debug_out_10_01_2019.txt

Running the stable code from develop-steven under python 3.6.6 (installed just prior to this comment: https://github.com/BradGreig/Hybrid21CM/issues/18#issuecomment-445166748) exhibits the hanging behaviour we saw in python 3.6.5 when I install pyfftw (returning to stable behaviour once I uninstall pyfftw).

Hope that is helpful :)

steven-murray commented 5 years ago

Awesome. That is hugely helpful, thanks! So it looks like pyfftw/powerbox is definitely the problem. For now, continue without pyfftw (all it does is use the FFTW package for the FFT's, instead of the numpy version). Using it should be faster (if it didn't hang), BUT in this case most of the time is spent running 21cmFAST, not doing the PS, so it shouldn't be a problem. I'll look into it on the powerbox/pyfftw side.

For reference: moving issue to https://github.com/steven-murray/powerbox/issues/3

BradGreig / Hybrid21CM

Multiprocessing Issues on Ubuntu #18