ArtPoon / kamphir

Phylogenetic inference using a tree-shape kernel in an Approximate Bayesian Computation framework
BSD 3-Clause "New" or "Revised" License
6 stars 2 forks source link

Repeating serialization error #33

Closed rmcclosk closed 9 years ago

rmcclosk commented 9 years ago

I ran the command from issue #29 overnight. In the morning, the log indicated that 6196 steps had completed. The below message was printing to the console about every 2 seconds.

Error in serialize(data, node$con, xdr = FALSE) : 
  error writing to connection

Some googling indicates this is probably an R problem, which I had thought was resolved before in #29.

Python had crashed at some point. Here are some relevant lines from the crash report. I saved the rest of the report.

ProblemType: Crash
ExecutablePath: /usr/bin/python2.7
ExecutableTimestamp: 1395532628
ProcCmdline: python kamphir.py DiffRisk examples/settings.rcolgem-DiffRisk1.json examples/rcolgem_c1-2.0_n-300_rho-0.9.nwk diffrisk-c1-2.log -kdecay 0.3 -tol0 0.005 -mintol 0.0025 -ncores 4 -nthreads 4 -nreps 20 -treenum 0 -seed 0
ProcCwd: /home/rmcclosk/Documents/kamphir

I killed the program with Control+C. Here's the traceback to show where we were.

Traceback (most recent call last):
  File "kamphir.py", line 688, in <module>

  File "kamphir.py", line 446, in abc_mcmc
    print 'ERROR: next_score (', next_score, ') outside interval [0,1], dumping proposal and EXIT'
  File "kamphir.py", line 377, in evaluate
    if len(trees) == 0:
  File "kamphir.py", line 269, in simulate_internal
    for newick in newicks:
  File "/home/rmcclosk/Documents/kamphir/rcolgem.py", line 252, in simulate_DiffRisk_trees
    robjects.r("tfgy <- make.fgy( t0, maxSampleTime, births, deaths, nonDemeDynamics, x0, migrations=migrations, "
  File "/usr/local/lib/python2.7/dist-packages/rpy2-2.5.6-py2.7-linux-x86_64.egg/rpy2/robjects/__init__.py", line 269, in __call__
    res = self.eval(p)
  File "/usr/local/lib/python2.7/dist-packages/rpy2-2.5.6-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 170, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/rpy2-2.5.6-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 100, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
KeyboardInterrupt

So, it appears that next_score was outside the allowed interval for some reason, we tried to dump the proposal and quit, but it failed.

One possibility is that sys.exit() didn't cleanly shut down all the threads and the R instance (instances?) they were attached to. So we still had R trying to pass stuff back to Python when it was shut down.

rmcclosk commented 9 years ago

Something possibly related from the sys.exit documentation: "Since exit() ultimately “only” raises an exception, it will only exit the process when called from the main thread, and the exception is not intercepted."

It also occurs to me that we are using separate multi-threading tools in Python and R. It's possible there are some unexpected side effects from doing that.

ArtPoon commented 9 years ago

Oof. Regarding use of separate multi-threading tools, I suppose that we could have each Python thread spawn its own instance of R, and then distribute jobs respectively. A kernel score outside the prescribed bounds could conceivably be due to numerical overflow (see large diagonal problem in kernel matrices).

ArtPoon commented 9 years ago

I just ran into this problem on my workstation. Delving into my system logs, I found this report:

Crashed Thread:  0  Dispatch queue: com.apple.main-thread

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00007fa80c66f948
[...]
Application Specific Information:
crashed on child side of fork pre-exec

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   rcolgem.so                      0x0000000110e45050 dQAL + 320
1   deSolve.so                      0x0000000110d1e3d8 derivs + 136
2   deSolve.so                      0x0000000110d1cbec rk_fixed + 1244
3   deSolve.so                      0x0000000110cc492d call_rkFixed + 3293
4   libR.dylib                      0x000000010e26f0fc do_dotcall + 1916 (dotcode.c:652)
5   libR.dylib                      0x000000010e2a036b Rf_eval + 1355 (eval.c:657)

I suspect there is some exception being thrown at the level of deSolve that is not being handled correctly by rcolgem/Python. Will try to find the offending line in dQAL.

ArtPoon commented 9 years ago

This might be fixed by rcolgem svn 126, integrated these changes as of commit 27ec24cdb4786d6e238e5e62bdd2d79bfbb78fb5

ArtPoon commented 9 years ago

I think this one's finally squished by commit 9fe8f34d011077ecc72714866497dbaaa9285e7f

ArtPoon commented 9 years ago

ARGH. Still no good. To reproduce, do a fresh pull, recompile and install rcolgem, and use this invocation:

python kamphir.py PANGEA settings/pangea.json /Users/art/git/pangea/data/February2015/Regional/hyphy/root2tip/150129_PANGEAsim_Regional_FirstObj_scA_SIMULATED_all.nwk scA.log -delimiter _ -ncores 5 -nthreads 5 -nreps 5 -tscale 0.142857
rmcclosk commented 9 years ago

I believe this is fixed with 95097956a17d31b5a83c8194042f02893d56560c. The problem was negative node heights, which would lead to an attempt to access a negative array index in dQAL (before the first time point when the population sizes are calculated). I put in a check for this so it will use the values from time zero if the time is negative. 30 steps so far no problems.

This is not a coherent fix but more of a stopgap just to get it to run. If the node heights are negative, that means the nodes are around before time zero. Since first infected individual appears at time zero, this scenario is impossible, and probably those trees should be thrown away instead.