gchen98 / macs

Automatically exported from code.google.com/p/macs
16 stars 6 forks source link

Nondeterminism in macs when python bindings are used #19

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Pick any binding method to allow you to call c++ code from python-- I've 
tried swig, cython and boost-python (the last being by far the easiest)
2. Run a simple macs command (e.g., macs 10 1 -t 10 -s 3) and the output will 
differ between the python version of the code and the "pure" c++ version. 
Namely, it will differ not in the number of segregating sites, but in the 
patterns of diversity seen (e.g., pi will be different).
3. In the code, when an Edge object is made, the top and bottom node heights 
will be different when you run macs natively, compared to when you run macs via 
python (as seen by just printing out the node heights and comparing the output)

What is the expected output? What do you see instead?
I would expect to see the same output when you run either via boost-python or 
via the native c++ code when you set the seed. As I see the same nondeterminism 
across 3 different ways of linking python to c++, I'm gathering that this is 
not a problem w/ the language bindings, but in macs code itself. Printing out 
the results of the random number generator shows the *same* results between the 
two different methods of invoking the code.
I have used python bindings with Hudson's ms, and it works great.

What version of the product are you using? On what operating system?
The most recent (used svn just 2 weeks ago). RHEL6 (on two different systems).

Please provide any additional information below.
It isn't really fair to call this a bug, but if you want to really be able to 
do a lot whole genome simulations, having a way to invoke macs as a function, 
and not a program, is pretty essential. EG, having an API would be awesome.

I also am unsure if what I'm seeing is really a bug-- I can't see why you'd get 
nondeterministic output when the RNG is giving the same output on either 
approach, but I can see how floating-point imprecision might also introduce a 
little noise in this scenario. That being said, for instance, when you do the 
above simulation you get different Newick trees out; the shape is exactly the 
same, but the node ordering is different... and I have a hard time chalking 
that up to floating point issues. Honestly, the tree differences won't matter 
for this panmictic simulations I'm doing, but I still need to see if it does 
matter when we start simulating structure. 

To be clear, to get the python bindings to work you do need to tweak the c++ 
code, but the tweaking is very minimal (and each of the 3 approaches above 
involved pretty different tweaks, all leading to the same error). I've tried 
using valgrind and it detects no memory errors (though I don't know how well it 
works with c++. I'm more of a C programmer...)

Again, I'm not sure of what a realistic answer to this post is, unless it's "of 
course there can be non-determinism when you fix the random number seed".

-August

Original issue reported on code.google.com by augu...@email.arizona.edu on 3 Oct 2013 at 8:14

GoogleCodeExporter commented 9 years ago
Well, I thought I found the bug... and it still may be one. 
double RandNumGenerator::expRV(double dLambda){ 
can (and does) return inf,

macs 10 1 -t 10 -s 3 will generate the behavior

Original comment by augu...@email.arizona.edu on 3 Oct 2013 at 8:44

GoogleCodeExporter commented 9 years ago
Hi August,

Regarding the expRV call returning infinity, in your case above, I think it is 
behaving as expected in that because you have a zero recombination rate, the 
next position along the chromosome should be infinity.  I checked the other 
calls in algorithm.cpp to expRV, and I don't think they should be passing a 
zero to expRV (e.g. coalescing rate) as there is always an if statement to 
check for positivity.

I'm not sure I can address the Python binding issue.  I can try but not promise 
anything. Can you show me what tweaks you had to make?

Original comment by gche...@gmail.com on 4 Oct 2013 at 4:40

GoogleCodeExporter commented 9 years ago
Well, I think I may have found something that will actually help diagnose this. 
In the beginSimulation function: If you change:
GraphBuilder graphBuilder = GraphBuilder(pConfig,rg);                           

       graphBuilder.build();                                                                  
graphBuilder.printHaplotypes();

TO:
            GraphBuilder *graphBuilder = new GraphBuilder(pConfig,rg);
            graphBuilder->build();
            graphBuilder->printHaplotypes();
            delete graphBuilder;

You get differences in the output of macs when you run just 1 simulation. EG:

./macs  100 5000 -T -t 0.001 -s 10 -r 0.0004 -h 1e5 -I 2 50 50 0.5 2> /dev/null 
| fgrep TOTAL_SITES
will return 42 in the original code and 44 in the code using 'new'. 
I'm not well-enough versed in the code to say whether or not this is expected 
behavior, but the code I had had a new'd copy of the GraphBuilder object on 
which simulations were being run, so this behavior comes as a bit of a surprise 
to me :)
And I can tweak the code such that I don't need the graphbuilder to be saved on 
the heap, but I don't know if this is symptomatic of some other type of memory 
error or not...?

Thanks for your willingness to help on this issue!
-August

Original comment by augu...@email.arizona.edu on 22 Oct 2013 at 9:58

GoogleCodeExporter commented 9 years ago
I realized that I mispoke (or mistyped). I was keeping the graphbuilder object 
b/c I wrote the whole-genome simulations to some arrays stored in this object. 
Sorry for the confusion!
-August

Original comment by augu...@email.arizona.edu on 22 Oct 2013 at 10:25

GoogleCodeExporter commented 9 years ago
The memory management behind macs is very complex.  It includes a smart pointer 
reference counting mechanism thanks to Boost libraries.  I don't know if Boost 
is to blame but I think digging into this would be monumental to say the least.

Now if you were to run macs over many iterations using the heap or non-heap 
version and computed summary statistics over each, would you get different 
answers?

Original comment by gche...@gmail.com on 22 Oct 2013 at 10:50

GoogleCodeExporter commented 9 years ago

Original comment by gche...@gmail.com on 4 May 2014 at 8:57