Closed e-koch closed 10 years ago
"I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. "
My approach in the past has been something like:
This always keeps the tree connected (and looks similar to what you've done here). Can you elaborate about what particular problem you're running into?
Ah ok. I've been merging the child into the parent instead of the sibling. If a leaf has no siblings, would it then be merged into its parent branch?
The issue I thought I had was redefining the values where the branches split. After looking at it again, I don't think this is a problem. I think this pruning just isn't doing the same as the pruning in compute, which is leading to the discrepancies.
Also, the pruning needs to be iterated to remove branches that are turned into leaves in the first passes. This gives something that looks like this:
Its similar to the recomputed one, but missing portions of the branches.
This is the scheme I'm currently using:
For the leaves:
For the branches:
@ChrisBeaumont , do you see any glaring omissions?
There are some parts of your description that I either don't understand or think should be different:
if leaf fails, merge into sibling. If there are no siblings, merge into parent
if the branch has only one child, merge into its parent A branch should never have only one child (though I can see how your scheme as it stands can create some).
More generally, I don't think you should ever prune a branch -- I think you should iteratively prune a leaf (which often turns a branch into a leaf, that might be pruned in a future iteration), until there are no more leaves to prune. Does that make sense?
Once the basic approach stabilizes, we will also need to add some tests. Those tests will be straightforward -- I think it's a reasonable requirement that any pruning parameters passed to post_prune will have the same effect as if those parameters were passed during initial dendrogram construction. Thus, the tests will build two dendrograms, one pruned at compute time and one post_pruned, and then assert that they have the same structure.
Thanks for tackling this, by the way. I think this will be a useful addition.
I've made the changes and things are looking fairly good. I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning. It would make it easier to keep track of individual structures. Let me know if you think they should be re-labelled though.
I think there's an issue in how compute does the pruning though. There are leaves in the dendrogram computed with the pruning conditions enabled (I'm using min_delta=1.0) that don't satisfy the conditions. So,
d = Dendrogram.compute(array)
d.post_pruning(min_delta=1.0)
does not give the same output as
d = Dendrogram.compute(array, min_delta=1.0)
BUT, then running
d.post_pruning(min_delta=1.0)
does.
I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning
Yeah this is an interesting question. On the one hand, it's appealing to retain index numbers for comparing pre- and post-pruned dendrograms. OTOH, it's tempting for client code to do things like use arrays with length equal to the number of structures, and use a structure index number to address locations in this array. That leads to problems when index numbers aren't a tightly-packed sequence. That could lead to subtle bugs. Any thoughts, @astrofrog / @keflavich / others?
We should also track down why compute and post_pruning aren't giving the same answer. Is it easy for you to add a (failing) test that exposes the inconsistency? Then I can try to dig into that.
Would it be overly complicated to include both a "permanent" ID and a "tight/index" ID, so that users can choose their use-case? Such that all pruning processes would alter the "tight" ID but would leave the "permanent" ID intact.
When I previously ran into similar problems with the connection between catalog indexing and _idx IDs, one workaround I had was a piece of code that looked like this:
idx_index_map = {}
for i, struct in zip(range(len(structures)), structures):
# assumes catalog data and the list of structures have the same ordering
idx_index_map[struct.idx] = i
for struct in structures:
child_index = idx_index_map[struct.idx]
parent_index = idx_index_map[struct.parent.idx]
# do something like plot lines between child & parent catalog properties....
I'll give writing the test a crack. This appears to only be an issue with min_delta.
I added the test using the testing data set in the package, but I'm having trouble reproducing the min_delta problem I'm encountering. The testing data passes the pruning tests, while the data I've been using isn't. I've tried this with sim data (PPV cube and an integrated intensity image) .
For the comparison, I'm checking that the dendrograms have the same number of features and that the features have the same vmax and vmin (sorted by max). Without redefining the idx's, direct comparison of the trees is a bit cumbersome.
hmm, would it be easy for you to email or post the data and script you are using that triggers the issue? (cbeaumont [at] cfa [dot] harvard [dot] edu).
Thanks, I got your email.
This was an issue with how min_delta was computed. I'll push a fix
This is looking good to me. I've made a bunch of style suggestions, to split up the post_pruning method into simpler pieces. If you could make one more pass at addressing some of these, I'm happy to merge.
One other thought -- should post_pruning
be renamed prune
?
I've made your suggested style changes and changed the name to prune
. It sounds cleaner. @ChrisBeaumont thanks for all the help getting this together!
Two more thoughts:
min_value
? I think it would need its own "pruning" method separate to other parameter changes.I wouldn't bother with min_value, since it's not so much a pruning parameter as a preprocessing step.
Adding a warnings.warn() message if the pruning parameters are less restrictive than the original inputs sounds like a good idea
I made those small changes and added warnings for the pruning parameters. The parameters also now get updated in self.params
.
I made a couple small changes to avoid the warnings if the parameter isn't specified (ie. equal to 0). @ChrisBeaumont does anything else need to be addressed?
Looks great to me! Assuming the travis test passes, I'm happy to merge this
Great! Thanks for all the help!
Thanks again!
Hi all, I have the need to run large numbers of dendrograms, varying min_delta for each (Figure 4 in Burkhart et al. 2013; http://iopscience.iop.org/0004-637X/770/2/141/pdf/apj_770_2_141.pdf). It would be nice to be able prune out features without having to recompute the whole dendrogram each time.
I've attempted to code this up, re-using the end stages of the compute function. Right now, it prunes off leaves and merges features together. I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. Can this be done without re-computing?
Here's a comparison of recomputing (top) to the output of post_pruning (bottom).