ashander / ftprime

Forward-time simulation of the msprime data structure (for development)
2 stars 1 forks source link

Fix/upgrade edgesets edges #44

Closed ashander closed 6 years ago

ashander commented 6 years ago

To work with current msprime master.

Todo, maybe

codecov[bot] commented 6 years ago

Codecov Report

Merging #44 into master will decrease coverage by 0.81%. The diff coverage is 78.94%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #44      +/-   ##
=========================================
- Coverage   82.12%   81.3%   -0.82%     
=========================================
  Files           3       3              
  Lines         207     214       +7     
=========================================
+ Hits          170     174       +4     
- Misses         37      40       +3
Impacted Files Coverage Δ
ftprime/argrecorder.py 75.79% <78.94%> (-0.88%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4088274...31f8879. Read the comment docs.

ashander commented 6 years ago

This looks like it passes but there is one skipped test which is to see if we get the same sequence from tree_sequence and simplify followed by tree_sequence.

We aren't getting the same due to mismatching sequence length --- when using only ArgRecorder.tree_sequence the sequence length is not pruned to the actual length spanned by the edges. As can be seen in the code the method that does not prune runs only msprime.TreeSequence.simplify() while the one that does uses msprime.simplify_tables() This is possibly an msprime bug, or we should change our expectations

Details

In this case we simplify this (using the old notation) tree sequence and simplify choosing samples 3,4. (Note: something seems wrong about the simplify here -- it should include 3 on (0.5, 1.0) but ignore that for now as the msprime issue linked below shows that TreeSequence.simplify is not reducing sequence_length)

|| Nodes:                                                                                                                                                                                
|| id   flags   population      time                                                                                                                                                     
|| 0    0       -1              1.00000000000000                                                                                                                                         
|| 1    1       -1              0.20000000000000                                                                                                                                         
|| 2    1       -1              0.00000000000000                                                                                                                                         
|| 3    1       2               2.00000000000000                                                                                                                                         
|| 4    1       2               2.00000000000000                                                                                                                                         
|| Edgesets:                                                                                                                                                                             
|| id   left            right           parent  children                                                                                                                                 
|| 0    0.00000000      1.00000000      0       1,2                                                                                                                                      
|| 1    0.00000000      0.50000000      1       3,4                                                                                                                                      
|| 2    0.50000000      1.00000000      1       3                                                                                                                                        

to the following two sequences

sequence a

(produced directly from ArgRecorder.tree_sequence)

|| ---------------- length sequence a -----------                                                                                                                                        || 0.5                                                                                                                                                                                   
|| id   flags   population      time                                                                                                                                                     
|| 0    1       2               0.00000000000000                                                                                                                                         
|| 1    1       2               0.00000000000000                                                                                                                                         
|| 2    0       -1              2.20000000000000                                                                                                                                         
|| id   left            right           parent  children                                                                                                                                 
|| 0    0.00000000      0.50000000      2       0,1                                                                                                                                      

sequence b

produced from first ArgRecorder.simplify then ArgRecorder.tree_sequence

|| id   flags   population      time                                                                                                                                                     
|| 0    1       2               0.00000000000000                                                                                                                                         
|| 1    1       2               0.00000000000000                                                                                                                                         
|| 2    0       -1              2.20000000000000                                                                                                                                         
|| id   left            right           parent  children                                                                                                                                 
|| 0    0.00000000      0.50000000      2       0,1                                                                                                                                      

issue

using earlier msprime code (jeromekelleher/msprime.git@63107256347f6a365309cadd6687640d11bb99db ) the sequence lenghts were both changed to 0.5.

I think it's now failing because recent changes (in https://github.com/jeromekelleher/msprime/commit/ff58e4658f725d19981c26a940a5ebb0f16f1c46 ) to the way the TreeSequence.sequence_length attribute is treated during simplify but have not parsed the code through simplify_tables vs TreeSequence.simplify() to understand the difference.

petrelharp commented 6 years ago

This is correct behavior, because if the sequence_length is not passed in, then it is determined to be the max right coordinate of the edges. For the first tree sequence:

 tsa = records.tree_sequence([4, 5])

what happens is that the entire tree sequence gets simplified; there are edges with right coordinate 1.0, so this is the sequence_length. For the second:

        records.simplify([4, 5])
        tsb = records.tree_sequence([4, 5])

the first round of simplify leaves only the single edge on [0.0, 0.5), so the second simplify that happens in .tree_sequence([4,5]) reduces the sequence length to this.

I think that the ARGrecorder should have a sequence_length attribute: it can be optional, and determined from input if not provided; then this can be passed to simplify_tables. That would eliminate this.

petrelharp commented 6 years ago

OK, I added it in, possibly correctly (but can't test on this computer...)

ashander commented 6 years ago

OK. I was missing this

if the sequence_length is not passed in, then it is determined to be the max right coordinate of the edges.

which happens somewhere in msprime code other than where I was looking.

But I still find this inconsistent seq length under simplify_tables vs TreeSequence.simplify (shown in https://github.com/jeromekelleher/msprime/issues/283 ) confusing.

ashander commented 6 years ago

Abstractly, your fix makes sense. The implementation has some issues: after fixing some obvious bugs, it's not working in 5dac744 But I may be missing something else, as one of the bugs I fixed was passing sequence_length to TreeSequence.simplify(), which doesn't work in current msprime. Let me know and/or I'll stare at it more tomorrow

ashander commented 6 years ago

Still confused that the entire tree seq simplified in

For the first tree sequence:

tsa = records.tree_sequence([4, 5])

what happens is that the entire tree sequence gets simplified; there are edges with right coordinate 1.0, so this is the sequence_length.

when we explicitly pass samples. However, this relates to general confusion about TreeSequence.simplify. See https://github.com/jeromekelleher/msprime/issues/283#issuecomment-337989900

ashander commented 6 years ago

OK! Made it work. Had to set it so providing ts or edges overrides provided seq len. Thoughts?

petrelharp commented 6 years ago

I think that sequence_length if passed in explicitly should override ts.sequence_length. And, I don't see why this fixed the problem - do you? By my reading, the only difference is that I had

if sequence_length is not None:
    self.sequence_length = sequence_length
elif ts is not None:
    self.sequence_length = ts.sequence_length

while you have

if ts is not None:
    self.sequence_length = ts.sequence_length
elif sequence_length is not None:
    self.sequence_length = sequence_length

and where we use this it should be called with ts not None and sequence_length=None, so these should be identical?

When I make that change (deleting the extra and (ts is not None) phrase), it still passes tests on my computer...

petrelharp commented 6 years ago

Also, sorry for the non-functional fixes...

petrelharp commented 6 years ago

Re: your residual confusion above

Still confused that the entire tree seq simplified in ... what I meant is that for tsa, the set of tables passed to simplify is everything specified (init_ts plus those edges added in this test), whereas to create tsb, since we first called records.simplify([4,5]), much of this has been dropped from the internal tables, including those edges going out to 1.0.

ashander commented 6 years ago

Re the residual confusion, I got it, mostly. It was mostly due to confusion/surprise at how msprime behaves. See https://github.com/jeromekelleher/msprime/issues/283#issuecomment-337994675

The key fix I made was adding sequence_length to load_tables and removing it from ts.simplify().

Various versions of the flow control stuff at the start should work, depending on how we want things to behave. I opened #45 to discuss exactly what we want to happen