dendrograms / astrodendro

Generate a dendrogram from a dataset
https://dendrograms.readthedocs.io/
Other
37 stars 38 forks source link

Make final dendrograms immutable #25

Open astrofrog opened 11 years ago

astrofrog commented 11 years ago

As discussed in #20, part of the infrastructure in the Structure class (e.g. the tree index) relies on the dendrogram being immutable after the computation. One of the suggestions we mentioned there was to split Structure into two classes - one representing mutable structures that are used for the initial computation, and one that is used for the final structure, which is efficient but immutable.

One question that I'm wondering now is that if we do this, the tree can no longer be pruned by the user after the fact, so is that what we want?

If we decide we want an alternative that leaves the possibility of having a flexible tree after the computation, we can use the hash of the index map to check whether the tree has changed - Numpy arrays can be hashed, so this is trivial. If the hash of the index map changes, all cached variables (including the tree index) would get reset.

low-sky commented 11 years ago

Ooops. wrong button. I think a post-computation prune of the tree will often be desired. It's probably more efficient to decimate a tree than to recompute it using new parameters. While this probably isn't needed at high priority right now, it could be a medium-term goal to offer functionality that moves one immutable tree and associated index map through a set of pruning options and produces a different tree.

astrofrog commented 11 years ago

@low-sky - that's a good point - if we provide a function to wrap the pruning, then we can control the mutability/immutability.

ChrisBeaumont commented 11 years ago

Ah, the pruning issue is a good point.

For speed reasons, I'm not sure we want to rely on a hash of the index map -- this potentially involves a scan of the entire dataset every time a Structure property is accessed.

Likewise, some profiling will determine whether it's faster to prune by creating a new set of structures, or by modifying the old ones in place. It isn't obvious to me which one will be faster, so we shouldn't rule out either option yet

astrofrog commented 11 years ago

@ChrisBeaumont - so for now,do you suggest that we should not go ahead and split the structures, or go ahead and then investigate how best to prune the tree?

ChrisBeaumont commented 11 years ago

The pruning question is more important, for sure.

@astrofrog, do you have time to put together some pseudocode of what pruning might look like? I suspect it will be conceptually easier to generate a new tree, rather than traversing the current one and modifying parent/child references throughout. I could be wrong about this.

That code will help me understand if creating a new tree from a prune is dramatically inefficient or complicated. Unless it is, I think splitting the sturctures into mutable/immutable classes is fine.

If we decide to split the structure classes, we can put off actually implementing pruning until that refactoring is done.

astrofrog commented 11 years ago

@ChrisBeaumont - here's a simple example, but I think it's not yet correct, because it would need to make sure that merge_level is set to the right value, and I'm not sure how you can figure that out because it's not clear whether if 3 leaves end up being merged into a structure, which two would have merged first, without recomputing the dendrogram. Or am I missing something?

import numpy as np

import matplotlib.pyplot as plt
from astrodendro import Dendrogram
from astropy.io import fits

d = Dendrogram.compute(fits.getdata("MSX_E.fits"), verbose=True, min_data_value=1.e-5, min_delta=1.e-5)

while True:
    m = 0
    for structure in d.leaves:
        if structure.get_npix() < 20:
            m += 1
            if structure.parent is not None:
                structure.parent.children.remove(structure)
                structure.parent._merge(structure)
                d.index_map[d.index_map == structure.idx] = structure.parent.idx
            else:
                d.index_map[d.index_map == structure.idx] = 0
            d.nodes_dict.pop(structure.idx)
    print("Merged {0} leaves".format(m))
    if m == 0:
        break

I'm actually not sure I quite get the point of merge_level - to me it should actually be a property of the child structure, not the parent structure - i.e. if 3 leaves merge to form a branch, then different leaves merged at different points - isn't it more interesting to find out which leaf merged when rather than which two structures merged first in the parent? I think I'm generally confused about merge_level and why that should be used to find the height of a leaf, especially if that leaf did not merge at that level.

Of course, we could just go ahead with the structure split, make the tree immutable and then if we find that pruning is needed and we can figure it out, then we can transform the tree back to a mutable tree, prune it, and make it immutable again?

I guess I also don't get why one would prune rather than just compute the tree with the right settings to start with since we can prune-as-you-go (:tm:)?

ChrisBeaumont commented 11 years ago

Do we even need merge_level? It's currently only used within Structure.height.

In fact, I don't understand what height describes -- it isn't useful for drawing purposes, since it's sensitive to noise spikes. For drawing purposes, I think a better definition for height is min(c.vmin for c in self.children) if self.children else self.vmax.

Likewise, a useful definition for the bottom of a structure for drawing purposes is self.parent.height if self.parent is not None else self.vmin

I'm fine if we want to eliminate merge_level

astrofrog commented 11 years ago

Sounds good to me! I was equally confused about the need for merge_level, so I think we can just remove it.

Do you think we should go ahead and try and split the Structure classes?

ChrisBeaumont commented 11 years ago

I'll take a pass at it, and show you the code.

I think pruning won't be a performance concern, since prune-as-you-go ( TM ) and make-a-new-tree ( TM ) should both be fast-ish (minutes, if the data fits into RAM). We can come back to this if someone decides interactive, real-time post-pruning is crucial

astrofrog commented 11 years ago

Ok, that sounds like a good plan. Feel free to attach the code to this issue.

astrofrog commented 11 years ago

@ChrisBeaumont - what do you think we should do about this? (making dendrograms immutable). Shall we try and get that into 0.1, or delay until 0.2?

ChrisBeaumont commented 11 years ago

Yes, this can wait On Jul 12, 2013 6:08 AM, "Thomas Robitaille" notifications@github.com wrote:

@ChrisBeaumont https://github.com/ChrisBeaumont - what do you think we should do about this? (making dendrograms immutable). Shall we try and get that into 0.1, or delay until 0.2?

— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/issues/25#issuecomment-20868481 .