Pruning computed dendrograms.

e-koch commented 10 years ago

Hi all, I have the need to run large numbers of dendrograms, varying min_delta for each (Figure 4 in Burkhart et al. 2013; http://iopscience.iop.org/0004-637X/770/2/141/pdf/apj_770_2_141.pdf). It would be nice to be able prune out features without having to recompute the whole dendrogram each time.

I've attempted to code this up, re-using the end stages of the compute function. Right now, it prunes off leaves and merges features together. I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. Can this be done without re-computing?

Here's a comparison of recomputing (top) to the output of post_pruning (bottom).

hd22 13co recomputed

hd22 13co postpruned

ChrisBeaumont commented 10 years ago

"I'm running into issues with how to ensure merged features meet the pruning requirement, while keeping the tree connected. "

My approach in the past has been something like:

Start with a leaf
While the selected structure fails some independence test, merge it with its sibling. Mark the merged structure as the "selected structure"
Repeat for all leaves

This always keeps the tree connected (and looks similar to what you've done here). Can you elaborate about what particular problem you're running into?

e-koch commented 10 years ago

Ah ok. I've been merging the child into the parent instead of the sibling. If a leaf has no siblings, would it then be merged into its parent branch?

The issue I thought I had was redefining the values where the branches split. After looking at it again, I don't think this is a problem. I think this pruning just isn't doing the same as the pruning in compute, which is leading to the discrepancies.

Also, the pruning needs to be iterated to remove branches that are turned into leaves in the first passes. This gives something that looks like this:

hd22 13co postpruned_iterated

Its similar to the recomputed one, but missing portions of the branches.

e-koch commented 10 years ago

This is the scheme I'm currently using:

For the leaves:

check independence tests for all leaves
if leaf fails, merge into sibling. If there are no siblings, merge into parent.

For the branches:

if the branch has only one child, merge into its parent
if it has more than one, its left alone
continue until all branches satisfy the second point.

@ChrisBeaumont , do you see any glaring omissions?

ChrisBeaumont commented 10 years ago

There are some parts of your description that I either don't understand or think should be different:

if leaf fails, merge into sibling. If there are no siblings, merge into parent

For the case of a single sibling, if a leaf is pruned then you should merge both the leaf and its sibling into the parent (which becomes a new leaf).
Leaves with >1 sibling are trickier. I would vote for merging the pruned leaf's pixels into the parent structure. So, say, a branch that splits into 3 leaves becomes a branch that splits into 2 leaves

if the branch has only one child, merge into its parent A branch should never have only one child (though I can see how your scheme as it stands can create some).

More generally, I don't think you should ever prune a branch -- I think you should iteratively prune a leaf (which often turns a branch into a leaf, that might be pruned in a future iteration), until there are no more leaves to prune. Does that make sense?

Once the basic approach stabilizes, we will also need to add some tests. Those tests will be straightforward -- I think it's a reasonable requirement that any pruning parameters passed to post_prune will have the same effect as if those parameters were passed during initial dendrogram construction. Thus, the tests will build two dendrograms, one pruned at compute time and one post_pruned, and then assert that they have the same structure.

Thanks for tackling this, by the way. I think this will be a useful addition.

e-koch commented 10 years ago

I've made the changes and things are looking fairly good. I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning. It would make it easier to keep track of individual structures. Let me know if you think they should be re-labelled though.

I think there's an issue in how compute does the pruning though. There are leaves in the dendrogram computed with the pruning conditions enabled (I'm using min_delta=1.0) that don't satisfy the conditions. So,

d = Dendrogram.compute(array)
d.post_pruning(min_delta=1.0)

does not give the same output as

d = Dendrogram.compute(array, min_delta=1.0)

BUT, then running

d.post_pruning(min_delta=1.0)

does.

ChrisBeaumont commented 10 years ago

I removed re-assigning the idx of the structures as this was leading to inconsistencies when comparing to the index map. I kind of like the idea of keeping the same idx for structures after pruning

Yeah this is an interesting question. On the one hand, it's appealing to retain index numbers for comparing pre- and post-pruned dendrograms. OTOH, it's tempting for client code to do things like use arrays with length equal to the number of structures, and use a structure index number to address locations in this array. That leads to problems when index numbers aren't a tightly-packed sequence. That could lead to subtle bugs. Any thoughts, @astrofrog / @keflavich / others?

We should also track down why compute and post_pruning aren't giving the same answer. Is it easy for you to add a (failing) test that exposes the inconsistency? Then I can try to dig into that.

tomr-stargazer commented 10 years ago

Would it be overly complicated to include both a "permanent" ID and a "tight/index" ID, so that users can choose their use-case? Such that all pruning processes would alter the "tight" ID but would leave the "permanent" ID intact.

When I previously ran into similar problems with the connection between catalog indexing and _idx IDs, one workaround I had was a piece of code that looked like this:

idx_index_map = {}
for i, struct in zip(range(len(structures)), structures):
    # assumes catalog data and the list of structures have the same ordering
    idx_index_map[struct.idx] = i
for struct in structures:    
    child_index = idx_index_map[struct.idx]
    parent_index = idx_index_map[struct.parent.idx]
    # do something like plot lines between child & parent catalog properties....

e-koch commented 10 years ago

I'll give writing the test a crack. This appears to only be an issue with min_delta.

e-koch commented 10 years ago

I added the test using the testing data set in the package, but I'm having trouble reproducing the min_delta problem I'm encountering. The testing data passes the pruning tests, while the data I've been using isn't. I've tried this with sim data (PPV cube and an integrated intensity image) .

For the comparison, I'm checking that the dendrograms have the same number of features and that the features have the same vmax and vmin (sorted by max). Without redefining the idx's, direct comparison of the trees is a bit cumbersome.

ChrisBeaumont commented 10 years ago

hmm, would it be easy for you to email or post the data and script you are using that triggers the issue? (cbeaumont [at] cfa [dot] harvard [dot] edu).

ChrisBeaumont commented 10 years ago

Thanks, I got your email.

This was an issue with how min_delta was computed. I'll push a fix

ChrisBeaumont commented 10 years ago

This is looking good to me. I've made a bunch of style suggestions, to split up the post_pruning method into simpler pieces. If you could make one more pass at addressing some of these, I'm happy to merge.

ChrisBeaumont commented 10 years ago

One other thought -- should post_pruning be renamed prune?

e-koch commented 10 years ago

I've made your suggested style changes and changed the name to prune. It sounds cleaner. @ChrisBeaumont thanks for all the help getting this together!

Two more thoughts:

Should we add the ability to change min_value? I think it would need its own "pruning" method separate to other parameter changes.
Is it worth it to add in warnings if the supplied pruning parameters are less than those used when computing?

ChrisBeaumont commented 10 years ago

I wouldn't bother with min_value, since it's not so much a pruning parameter as a preprocessing step.

Adding a warnings.warn() message if the pruning parameters are less restrictive than the original inputs sounds like a good idea

e-koch commented 10 years ago

I made those small changes and added warnings for the pruning parameters. The parameters also now get updated in self.params.

e-koch commented 10 years ago

I made a couple small changes to avoid the warnings if the parameter isn't specified (ie. equal to 0). @ChrisBeaumont does anything else need to be addressed?

ChrisBeaumont commented 10 years ago

Looks great to me! Assuming the travis test passes, I'm happy to merge this

e-koch commented 10 years ago

Great! Thanks for all the help!

ChrisBeaumont commented 10 years ago

Thanks again!

dendrograms / astrodendro

Pruning computed dendrograms. #111