ashander / ftprime

Forward-time simulation of the msprime data structure (for development)
2 stars 1 forks source link

do simplify() in steps #24

Closed petrelharp closed 6 years ago

petrelharp commented 7 years ago

Currently we record all of history and then simplify() it all at once. This ends up taking rather a lot of memory. As discussed with @jeromekelleher, we ought to be able to do this stepwise. To do this, we'll need to make sure there's room enough to sample the entire population, and then:

  1. add_samples(ids) where ids is the current generation, and remember that ids -> range(len(ids))
  2. simplify()
  3. restart the argrecorder and remember that range(len(ids)) -> initial_gen
  4. run for a while longer
  5. append results to tree sequence and return to 1

To append more things to a tree sequence we need to:

  1. shift time in the NodeTable
  2. move the samples to the end as new IDs
  3. append new nodes and edgesets while remapping the IDs by adding a constant to all IDs and remapping initial_gen to whatever they should be
jeromekelleher commented 7 years ago

How much simpler would it be if IDs didn't have to be 0..n - 1? I can probably remove this requirement fairly quickly if you're happy with possible bugs that may pop up. Simplify will probably still map the samples back to 0... n -1 though --- changing this wouldn't be impossible, just a bit more work as we'd have to change the interface a bit.

petrelharp commented 7 years ago

It's a small but nontrivial annoyance to always be keeping space at the start of ID space for the samples. It's easy to work around, so noo big deal, but also easy to forget - was yhe source of a bug by @ashander just te other day.

On Mon, Apr 24, 2017 at 7:53 AM, Jerome Kelleher notifications@github.com wrote:

How much simpler would it be if IDs didn't have to be 0..n - 1? I can probably remove this requirement fairly quickly if you're happy with possible bugs that may pop up. Simplify will probably still map the samples back to 0... n -1 though --- changing this wouldn't be impossible, just a bit more work as we'd have to change the interface a bit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/issues/24#issuecomment-296693884, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26QPS2XuLz31E9xqo47FfjurrkhM-ks5rzLd1gaJpZM4NGQVd .

jeromekelleher commented 7 years ago

OK, I'll have a look this evening to see if it can be done reasonably easily. I'll let you know either way.

ashander commented 7 years ago

This change could improve performance a lot, hopefully. Results in #25 suggest a most time is spent merging records, which scales with the number of nodes as ___ (TBD) and number of edgesets as (TBD).

Also, as @petrelharp and I were discussing today, this change is equivalent to:

  1. running a simulation on a population p from time t0 to t1 and using our alg to record the ARG
  2. simplifying the recorded ARG to make a tree sequence T1
  3. running the simulation further from t1 to t2
  4. simplifying the second recorded ARG to make a tree sequence T2 5 adding T1 and T2 together --- if there is a well defined way to "add" tree sequences.

we're thinking that we can define a clear way to "add" two tree sequences T1 and T2, given: a mapping of node numbers in T2 to those in T1, and an offset relating times between the tree sequences. (note: I'm writing "add" because the operator would be non-commutative and assume that T1 is earlier in time than T2.)

petrelharp commented 7 years ago

Hm, I'm not sure:

This change would reduce memory usage (which is being a problem), but that's not what's leading to the slowdown we see over there.

Or am I missing something?

ashander commented 7 years ago

:grimacing: yes the results in #25 aren't a clear case for this. (thought it's still probably good to do.) my thought was that as dict size increases the operations on the dict slow down, possibly to a crawl. but need to do more profiling across a range of pop sizes to see this. I'll do that comment more over in #25

petrelharp commented 7 years ago

Accessing dicts should scale as log(n), I would think. Hm, but we are inserting new things.

On Wed, Apr 26, 2017 at 9:15 AM, ashander notifications@github.com wrote:

😬 yes the results in #25 https://github.com/ashander/ftprime/pull/25 aren't a clear case for this. (thought it's still probably good to do.) my thought was that as dict size increases the operations on the dict slow down, possibly to a crawl. but need to do more profiling across a range of pop sizes to see this. I'll do that comment more over in #25 https://github.com/ashander/ftprime/pull/25

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/issues/24#issuecomment-297463123, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26aYrkim82wni60LDo8NaCyhkYbxKks5rz22dgaJpZM4NGQVd .

jeromekelleher commented 7 years ago

Just butting in here:

  1. simplifying the second recorded ARG to make a tree sequence T2
  2. adding T1 and T2 together --- if there is a well defined way to "add" tree sequences.

Is adding T1 and T2 together the right way to do this? I had imagined that the same tree sequence would be incrementally updated at each step. So, it would be something like:

  1. T = empty tree sequence
  2. Run the simulation for delta_t, adding records into T. t += delta_t
  3. T = T.simplify()
  4. if t < t_max, goto 2.

This seems a bit simpler to me. I don't think there will be much overhead in re-simplifying the older bits of T repeatedly, and it avoids the complexity of figuring out how to add two tree sequences together.

petrelharp commented 7 years ago

I agree with Jerome, which is also what I wrote initially. I hadn't caught that Jaime suggested something different.

ashander commented 7 years ago

Peter's original suggestion and Jerome's elaboration (no worries @jeromekelleher your input is always more than welcome!) is a good way to go for the change suggested in this issue.

My comment left some context out and could have been clearer. I should've mentioned that, for other reasons, it might be useful to define a way to "add" two tree sequences. For example if we wanted to run coalescent sims in reverse time, then run many replicate forward sims from that starting point. But implementing "add" is clearly not necessary to do simplify in steps.

ashander commented 6 years ago

But implementing "add" is clearly not necessary to do simplify in steps.

though it turned out to be useful :)