ashander / ftprime

Forward-time simulation of the msprime data structure (for development)
2 stars 1 forks source link

Add tree tests to ftprime #54

Closed ashander closed 6 years ago

ashander commented 6 years ago

just for now testing the number of trees produced vs supposedly equivalent msprime simulation

fixing this or getting the scaling right should resolve (some) recombination issues in https://github.com/petrelharp/ftprime_ms/pull/11

ashander commented 6 years ago

This is failing with my test (see diff) to see if number of trees are within 3 standard deviations of the sample mean number of trees produced in msprime. Where we fail, we have a larger number of trees than msprime.

expected mean (sample from 100 msprime runs) - one run with ftprime:


        std = math.sqrt(1 / (len(tree_counts) - 1) * sse)

        assert ts.sample_size == mspts.sample_size
        # TODO fix  expected error
>       assert abs(mean - ts.num_trees) < 3 * std
E       assert 113.47999999999999 < (3 * 25.60953105404472)
E        +  where 113.47999999999999 = abs((134.52 - 248))
E        +    where 248 = <msprime.trees.TreeSequence object at 0x7f77f2ad0fd0>.num_trees
--
        std = math.sqrt(1 / (len(tree_counts) - 1) * sse)

        assert ts.sample_size == mspts.sample_size
        # TODO fix  expected error
>       assert abs(mean - ts.num_trees) < 3 * std
E       assert 156.26 < (3 * 39.79584770989928)
E        +  where 156.26 = abs((265.74 - 422))
E        +    where 422 = <msprime.trees.TreeSequence object at 0x7f77f2bdec50>.num_trees
ashander commented 6 years ago

@petrelharp thanks for earlier feedback . we still have this issue (see two examples above) where our number of trees exceeds that from the supposedly equivalent msprime sims. this is the opposite to what we'd expect if simuPOP were skipping recombinations

can you

petrelharp commented 6 years ago

Since recombs are produced by simuPOP; I'm not sure this should be a ftprime issue. If there is an issue here, either (a) simuPOP is not producing the number of recombs we expect, or (b) we are not properly recording them. I think we adequately test (b) elsewhere, so a test of this would just involve simuPOP. It could be done by running simuPOP with N diploid individuals for T generations, writing out recombinations to a file; then the number of recombinations in that file (which something like cut -f 4- recombs.txt | wc -w) should be N * T * r * L plus/minus 3 * sqrt(N * T * r * L), where r is the recomb rate per locus and L is the number of loci.

petrelharp commented 6 years ago

oh, right: ALSO, I am not sure that number of trees is a good check, because I forget if simplify merges adjacent identical edges.

ashander commented 6 years ago

Oh! I should check that before I use it any further

On Thu, Nov 9, 2017 at 9:15 AM, Peter Ralph notifications@github.com wrote:

oh, right: ALSO, I am not sure that number of trees is a good check, because I forget if simplify merges adjacent identical edges.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/pull/54#issuecomment-343225288, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOGSsvCY3ndMRVHfgOkQbUg0akjyIks5s0zMkgaJpZM4QXDqG .

ashander commented 6 years ago

and good point about this not being inherent to ftprime.recomb_collector. though I'd argue it's nice to have examples where the resulting treeseq matches our input recombination rate! (as you point out, num_trees may not be a good test of this).

It could be done by running simuPOP with N diploid individuals for T generations, writing out recombinations to a file; then the number of recombinations in that file (which something like cut -f 4- recombs.txt | wc -w) should be N T r L plus/minus 3 sqrt(N T r * L), where r is the recomb rate per locus and L is the number of loci.

but I'll do this