BradGreig / Hybrid21CM

1 stars 3 forks source link

run_coeval segfaulting #1

Closed steven-murray closed 6 years ago

steven-murray commented 6 years ago

The command

$21CMMC coeval 7 --do-spin -z 1.2

fails with a segfault. The output from valgrind is

==13876== Invalid read of size 8
==13876==    at 0x217F04D2: kappa_10 (heating_helper_progs.c:559)
==13876==    by 0x217FD86E: ComputeTsBox (SpinTemperatureBox.c:696)
==13876==    by 0x218082FC: _cffi_f_ComputeTsBox (py21cmmc._21cmfast._21cmfast.c:1303)
==13876==    by 0x217690: _PyCFunction_FastCallDict (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2A6ACB: call_function (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2C94B9: _PyEval_EvalFrameDefault (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x29FF05: _PyEval_EvalCodeWithName (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2A0F4E: fast_function (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2A6BA4: call_function (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2CA278: _PyEval_EvalFrameDefault (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x29FF05: _PyEval_EvalCodeWithName (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2A0F4E: fast_function (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==  Address 0xfffffffc21a6dc40 is not stack'd, malloc'd or (recently) free'd
==13876== 
==13876== 
==13876== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==13876==  Access not within mapped region at address 0xFFFFFFFC21A6DC40
==13876==    at 0x217F04D2: kappa_10 (heating_helper_progs.c:559)
==13876==    by 0x217FD86E: ComputeTsBox (SpinTemperatureBox.c:696)
==13876==    by 0x218082FC: _cffi_f_ComputeTsBox (py21cmmc._21cmfast._21cmfast.c:1303)
==13876==    by 0x217690: _PyCFunction_FastCallDict (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2A6ACB: call_function (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2C94B9: _PyEval_EvalFrameDefault (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x29FF05: _PyEval_EvalCodeWithName (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2A0F4E: fast_function (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2A6BA4: call_function (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2CA278: _PyEval_EvalFrameDefault (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x29FF05: _PyEval_EvalCodeWithName (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==    by 0x2A0F4E: fast_function (in /home/steven/miniconda3/envs/21CMMC/bin/python3.6)
==13876==  If you believe this happened as a result of a stack
==13876==  overflow in your program's main thread (unlikely but
==13876==  possible), you can try to increase the size of the
==13876==  main thread stack using the --main-stacksize= flag.
==13876==  The main thread stack size used in this run was 8388608.
==13876== 
==13876== HEAP SUMMARY:
==13876==     in use at exit: 136,722,649 bytes in 110,437 blocks
==13876==   total heap usage: 514,914 allocs, 404,477 frees, 2,834,357,772 bytes allocated
==13876== 
==13876== LEAK SUMMARY:
==13876==    definitely lost: 405,270 bytes in 158 blocks
==13876==    indirectly lost: 2,938,900 bytes in 4,680 blocks
==13876==      possibly lost: 1,386,398 bytes in 1,257 blocks
==13876==    still reachable: 131,992,081 bytes in 104,342 blocks
==13876==         suppressed: 0 bytes in 0 blocks
==13876== Rerun with --leak-check=full to see details of leaked memory
==13876== 
==13876== For counts of detected and suppressed errors, rerun with: -v
==13876== ERROR SUMMARY: 531 errors from 73 contexts (suppressed: 38271 from 745)
steven-murray commented 6 years ago

Interestingly, just doing $21CMMC spin 7 -z 1.2 works fine.

steven-murray commented 6 years ago

Having a brief look at the C code, it seems to be that the Tk_spline isn't filled and then is used... ideas?

BradGreig commented 6 years ago

Running the same as you, I did not receive a segfault. Instead I received the following:

Traceback (most recent call last): File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/bin/21CMMC", line 11, in load_entry_point('py21cmmc', 'console_scripts', '21CMMC')() File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 535, in invoke return callback(args, *kwargs) File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), args, **kwargs) File "/Users/bradleygreig/Documents/21cmMC/Hybrid21CMMC/GitVersion/Hybrid21CM/src/py21cmmc/cli.py", line 318, in coeval regenerate=regen, write=True, direc=direc, match_seed=match_seed File "/Users/bradleygreig/Documents/21cmMC/Hybrid21CMMC/GitVersion/Hybrid21CM/src/py21cmmc/_21cmfast/wrapper.py", line 1239, in run_coeval st = copy.deepcopy(st2) File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/copy.py", line 169, in deepcopy rv = reductor(4) File "/Users/bradleygreig/Documents/21cmMC/Hybrid21CMMC/GitVersion/Hybrid21CM/src/py21cmmc/_21cmfast/_utils.py", line 495, in getstate return {k:v for k,v in self.dict.items() if not isinstance(k, self.ffi.CData)} File "/Users/bradleygreig/Documents/21cmMC/Hybrid21CMMC/GitVersion/Hybrid21CM/src/py21cmmc/_21cmfast/_utils.py", line 495, in return {k:v for k,v in self.dict.items() if not isinstance(k, self.ffi.CData)} AttributeError: 'TsBox' object has no attribute 'ffi'

BradGreig commented 6 years ago

I note however, that the average values that I am outputting (for debugging) return "nan". Which I suspect is the same issue, just in a different form. That is, my compiler might be dealing accessing different (existing) memory space.

I think you might be right regarding the non-initialisation of the interpolation.

Line 179 of SpinTemperatureBox.c is the logic statement defining whether or not to initialise the X-ray heating tables. This condition is failing when $21CMMC coeval 7 --do-spin -z 1.2 is called.

This condition needs to be passed for the code to function properly

BradGreig commented 6 years ago

I just tried removing that condition at line 179, but that didn't fix the problem. Mostly likely because that same condition is required several other times.

Actually, I forced first_box to be true (changing it within C) and the nan values were gone. So, I think this is indeed the problem.

Hmm, nope, this can't be the issue. I'll dig deeper...

Regardless though, that Python error is consistent. AttributeError: 'TsBox' object has no attribute 'ffi'Not sure what is going on there.

steven-murray commented 6 years ago

Yeah, I eventually got that error too. It is fixed now (in my branch).

steven-murray commented 6 years ago

When I run, it does the first spin box fine (at z=36 or so), then fails on the next spin box (at z=30 or so). It should be doing the init_heat() on the first box, right? So then it shouldn't fail on the second box given it's in the same session.

steven-murray commented 6 years ago

in heating_helper_progs... is the tkin_spline variable local?

BradGreig commented 6 years ago

I have fixed the "nan" issue, but I am not sure if that solves the overall issue. I now get sensible numbers up until I receive the Python error.

Have you pushed your fix?

steven-murray commented 6 years ago

yes.

BradGreig commented 6 years ago

Ok, I pulled your fix, and ran it with my fix and it "mostly" worked. I have pushed my fix.

Expect the following error though:

Traceback (most recent call last):
  File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/bin/21CMMC", line 11, in <module>
    load_entry_point('py21cmmc', 'console_scripts', '21CMMC')()
  File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/bradleygreig/anaconda3/envs/hybrid21CMMC-develop-brad/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/bradleygreig/Documents/21cmMC/Hybrid21CMMC/GitVersion/Hybrid21CM/src/py21cmmc/cli.py", line 318, in coeval
    regenerate=regen, write=True, direc=direc, match_seed=match_seed
  File "/Users/bradleygreig/Documents/21cmMC/Hybrid21CMMC/GitVersion/Hybrid21CM/src/py21cmmc/_21cmfast/wrapper.py", line 1252, in run_coeval
    bt += [brightness_temperature(ib, perturb[minarg], st if do_spin_temp else None)]
  File "/Users/bradleygreig/Documents/21cmMC/Hybrid21CMMC/GitVersion/Hybrid21CM/src/py21cmmc/_21cmfast/wrapper.py", line 1114, in brightness_temperature
    spin_temp(), ionized_box(), perturb_field(), box())
TypeError: cdata 'struct TsBox *' is not callable
BradGreig commented 6 years ago

And I am going home! :)

steven-murray commented 6 years ago

okay, awesome, thanks! I think I can fix that :-)

steven-murray commented 6 years ago

this seems to all be fixed as of f4336230c030105ed51f01d2d0966addf393843d