Closed petrelharp closed 6 years ago
How much simpler would it be if IDs didn't have to be 0..n - 1? I can probably remove this requirement fairly quickly if you're happy with possible bugs that may pop up. Simplify will probably still map the samples back to 0... n -1 though --- changing this wouldn't be impossible, just a bit more work as we'd have to change the interface a bit.
It's a small but nontrivial annoyance to always be keeping space at the start of ID space for the samples. It's easy to work around, so noo big deal, but also easy to forget - was yhe source of a bug by @ashander just te other day.
On Mon, Apr 24, 2017 at 7:53 AM, Jerome Kelleher notifications@github.com wrote:
How much simpler would it be if IDs didn't have to be 0..n - 1? I can probably remove this requirement fairly quickly if you're happy with possible bugs that may pop up. Simplify will probably still map the samples back to 0... n -1 though --- changing this wouldn't be impossible, just a bit more work as we'd have to change the interface a bit.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/issues/24#issuecomment-296693884, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26QPS2XuLz31E9xqo47FfjurrkhM-ks5rzLd1gaJpZM4NGQVd .
OK, I'll have a look this evening to see if it can be done reasonably easily. I'll let you know either way.
This change could improve performance a lot, hopefully. Results in #25 suggest a most time is spent merging records, which scales with the number of nodes as ___ (TBD) and number of edgesets as (TBD).
Also, as @petrelharp and I were discussing today, this change is equivalent to:
p
from time t0
to t1
and using our alg to record the ARGT1
t1
to t2
T2
5 adding T1
and T2
together --- if there is a well defined way to "add" tree sequences.we're thinking that we can define a clear way to "add" two tree sequences T1
and T2
, given: a mapping of node numbers in T2
to those in T1
, and an offset relating times between the tree sequences. (note: I'm writing "add" because the operator would be non-commutative and assume that T1
is earlier in time than T2
.)
Hm, I'm not sure:
if I read them right, those results suggest that only about 30% of the time is in add_record
: an improvement, but not the order of magnitude we're looking for there.
I don't think merging in steps would reduce the work of add_record
, because once an individual dies their entry in the argrecorder's dict does not change; so having more history around does not make this task more complex. The dict does grow in size, so if that causes python to go slower that would be an issue, but I don't think it is a serious one?
This change would reduce memory usage (which is being a problem), but that's not what's leading to the slowdown we see over there.
Or am I missing something?
:grimacing: yes the results in #25 aren't a clear case for this. (thought it's still probably good to do.) my thought was that as dict size increases the operations on the dict slow down, possibly to a crawl. but need to do more profiling across a range of pop sizes to see this. I'll do that comment more over in #25
Accessing dicts should scale as log(n), I would think. Hm, but we are inserting new things.
On Wed, Apr 26, 2017 at 9:15 AM, ashander notifications@github.com wrote:
😬 yes the results in #25 https://github.com/ashander/ftprime/pull/25 aren't a clear case for this. (thought it's still probably good to do.) my thought was that as dict size increases the operations on the dict slow down, possibly to a crawl. but need to do more profiling across a range of pop sizes to see this. I'll do that comment more over in #25 https://github.com/ashander/ftprime/pull/25
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/issues/24#issuecomment-297463123, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26aYrkim82wni60LDo8NaCyhkYbxKks5rz22dgaJpZM4NGQVd .
Just butting in here:
- simplifying the second recorded ARG to make a tree sequence T2
- adding T1 and T2 together --- if there is a well defined way to "add" tree sequences.
Is adding T1 and T2 together the right way to do this? I had imagined that the same tree sequence would be incrementally updated at each step. So, it would be something like:
This seems a bit simpler to me. I don't think there will be much overhead in re-simplifying the older bits of T repeatedly, and it avoids the complexity of figuring out how to add two tree sequences together.
I agree with Jerome, which is also what I wrote initially. I hadn't caught that Jaime suggested something different.
Peter's original suggestion and Jerome's elaboration (no worries @jeromekelleher your input is always more than welcome!) is a good way to go for the change suggested in this issue.
My comment left some context out and could have been clearer. I should've mentioned that, for other reasons, it might be useful to define a way to "add" two tree sequences. For example if we wanted to run coalescent sims in reverse time, then run many replicate forward sims from that starting point. But implementing "add" is clearly not necessary to do simplify in steps.
But implementing "add" is clearly not necessary to do simplify in steps.
though it turned out to be useful :)
Currently we record all of history and then
simplify()
it all at once. This ends up taking rather a lot of memory. As discussed with @jeromekelleher, we ought to be able to do this stepwise. To do this, we'll need to make sure there's room enough to sample the entire population, and then:add_samples(ids)
whereids
is the current generation, and remember thatids -> range(len(ids))
simplify()
argrecorder
and remember thatrange(len(ids)) -> initial_gen
To append more things to a tree sequence we need to:
NodeTable
initial_gen
to whatever they should be