Open astrofrog opened 11 years ago
Ooops. wrong button. I think a post-computation prune of the tree will often be desired. It's probably more efficient to decimate a tree than to recompute it using new parameters. While this probably isn't needed at high priority right now, it could be a medium-term goal to offer functionality that moves one immutable tree and associated index map through a set of pruning options and produces a different tree.
@low-sky - that's a good point - if we provide a function to wrap the pruning, then we can control the mutability/immutability.
Ah, the pruning issue is a good point.
For speed reasons, I'm not sure we want to rely on a hash of the index map -- this potentially involves a scan of the entire dataset every time a Structure property is accessed.
Likewise, some profiling will determine whether it's faster to prune by creating a new set of structures, or by modifying the old ones in place. It isn't obvious to me which one will be faster, so we shouldn't rule out either option yet
@ChrisBeaumont - so for now,do you suggest that we should not go ahead and split the structures, or go ahead and then investigate how best to prune the tree?
The pruning question is more important, for sure.
@astrofrog, do you have time to put together some pseudocode of what pruning might look like? I suspect it will be conceptually easier to generate a new tree, rather than traversing the current one and modifying parent/child references throughout. I could be wrong about this.
That code will help me understand if creating a new tree from a prune is dramatically inefficient or complicated. Unless it is, I think splitting the sturctures into mutable/immutable classes is fine.
If we decide to split the structure classes, we can put off actually implementing pruning until that refactoring is done.
@ChrisBeaumont - here's a simple example, but I think it's not yet correct, because it would need to make sure that merge_level
is set to the right value, and I'm not sure how you can figure that out because it's not clear whether if 3 leaves end up being merged into a structure, which two would have merged first, without recomputing the dendrogram. Or am I missing something?
import numpy as np
import matplotlib.pyplot as plt
from astrodendro import Dendrogram
from astropy.io import fits
d = Dendrogram.compute(fits.getdata("MSX_E.fits"), verbose=True, min_data_value=1.e-5, min_delta=1.e-5)
while True:
m = 0
for structure in d.leaves:
if structure.get_npix() < 20:
m += 1
if structure.parent is not None:
structure.parent.children.remove(structure)
structure.parent._merge(structure)
d.index_map[d.index_map == structure.idx] = structure.parent.idx
else:
d.index_map[d.index_map == structure.idx] = 0
d.nodes_dict.pop(structure.idx)
print("Merged {0} leaves".format(m))
if m == 0:
break
I'm actually not sure I quite get the point of merge_level
- to me it should actually be a property of the child structure, not the parent structure - i.e. if 3 leaves merge to form a branch, then different leaves merged at different points - isn't it more interesting to find out which leaf merged when rather than which two structures merged first in the parent? I think I'm generally confused about merge_level
and why that should be used to find the height of a leaf, especially if that leaf did not merge at that level.
Of course, we could just go ahead with the structure split, make the tree immutable and then if we find that pruning is needed and we can figure it out, then we can transform the tree back to a mutable tree, prune it, and make it immutable again?
I guess I also don't get why one would prune rather than just compute the tree with the right settings to start with since we can prune-as-you-go (:tm:)?
Do we even need merge_level
? It's currently only used within Structure.height
.
In fact, I don't understand what height
describes -- it isn't useful for drawing purposes, since it's sensitive to noise spikes. For drawing purposes, I think a better definition for height is min(c.vmin for c in self.children) if self.children else self.vmax
.
Likewise, a useful definition for the bottom of a structure for drawing purposes is self.parent.height if self.parent is not None else self.vmin
I'm fine if we want to eliminate merge_level
Sounds good to me! I was equally confused about the need for merge_level
, so I think we can just remove it.
Do you think we should go ahead and try and split the Structure classes?
I'll take a pass at it, and show you the code.
I think pruning won't be a performance concern, since prune-as-you-go ( TM ) and make-a-new-tree ( TM ) should both be fast-ish (minutes, if the data fits into RAM). We can come back to this if someone decides interactive, real-time post-pruning is crucial
Ok, that sounds like a good plan. Feel free to attach the code to this issue.
@ChrisBeaumont - what do you think we should do about this? (making dendrograms immutable). Shall we try and get that into 0.1, or delay until 0.2?
Yes, this can wait On Jul 12, 2013 6:08 AM, "Thomas Robitaille" notifications@github.com wrote:
@ChrisBeaumont https://github.com/ChrisBeaumont - what do you think we should do about this? (making dendrograms immutable). Shall we try and get that into 0.1, or delay until 0.2?
— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/issues/25#issuecomment-20868481 .
As discussed in #20, part of the infrastructure in the
Structure
class (e.g. the tree index) relies on the dendrogram being immutable after the computation. One of the suggestions we mentioned there was to splitStructure
into two classes - one representing mutable structures that are used for the initial computation, and one that is used for the final structure, which is efficient but immutable.One question that I'm wondering now is that if we do this, the tree can no longer be pruned by the user after the fact, so is that what we want?
If we decide we want an alternative that leaves the possibility of having a flexible tree after the computation, we can use the hash of the index map to check whether the tree has changed - Numpy arrays can be hashed, so this is trivial. If the hash of the index map changes, all cached variables (including the tree index) would get reset.