Closed astrofrog closed 11 years ago
Fine by me.
Some more questions along these lines: should Branch
really be a sublcass of Leaf
? This seems backwards, conceptually.
Branch and Leaf have the same interface, except that Branch has a descendants
method. This might not be the best name either, since descendents
in a family tree point rootward, while this method points leafward.
I vaguely remember discussing some of this with @bradenmacdonald, so he may want to chime in. I didn't see any specific discussion of this on the wiki. But I propose the following additional refactorings:
ancestor
/descendant
terminology with leafward
/rootward
descendents
method to Leaf
class, and have it return an empty list (or [self]
, maybe)Leaf/Branch
interface in a Structure
base class, that both Branch
and Leaf
inherit fromleaf
property to Structure
, that returns whether a structure is a terminal nodeBranch
and Leaf
sibling subclasses of Structure
Leaf
entirely. Instead, have a Leaf
factory function that creates leaf-like structures, that the dendrogram compute
method can use instead of creating Leaf
instances.Branch
logic into Structure
. Then a dendrogram is just a collection of Structures
, which is cleaner.Whether or not the last 2 points make sense depends on how much extra logic is required to handle leaf and non-leaf cases in the body of methods like fill_footprint
.
I was actually going to raise the same point about inheritance - I got confused initially today when isinstance(item, Leaf)
gave me True
for a branch! So I agree that we should use a Structure
base class. I also think that the trunk should be a sub-class of Structure
(that is distinct from Branch
because in the Trunk
, there is no connection between different objects).
descendent
is actually not always rootward
, since there are several ways to plot family trees, but because of this ambiguity, I agree we should be using leafward
/rootward
.
I kind of like having the Leaf
class, as it's nicer (in my opinion) to check that something is a Leaf
than to check it has zero descendents.
Another question (related to something you asked in an email) - when one accesses the values (coords and f currently) for a branch, should it be returning the values for all leafward items? My feeling is that it should, but it doesn't currently.
I can try and open a pull request that implements a lot of this, so we can see how it would look.
Agreed that isinstance(s, Leaf)
is more meaningful than len(s.leafwards) == 0
. That's partially why I suggested adding the leaf
boolean property - then the test would just be if s.leaf
.
My deeper question is whether the conceptual distinction between a Branch
and a Leaf
is meaningful enough that we should have 2 different types. My interface preference is to use a single Structure
class, unless the implementation is too complicated.
I also have mixed feelings about Trunk
, since it seems like it just boils down to a list of base-level structures -- cant this just be an attribute inside Dendrogram
? Maybe if I see the code for Trunk
, it will make more sense.
I think structure.f
(or structure.value
or whatever it gets renamed to) should include the contribution from substructures. coords
is less obvious -- I commonly want indices including and excluding substructures. Many of the methods have a subtree=False
to toggle this option, so maybe we need a new method to return both forms of coordinates as well.
I agree that there isn't really a fundamental difference between a leaf and a branch, and having everything be a structure instead means it's easier to make a leaf into a branch if needed. I would prefer is_leaf
as an attribute to make it clearer it will return a boolean.
Regarding the trunk, my thinking was, what if I want e.g. all the flux values in the dendrogram? Since I can do e.g. leaf.f
and branch.f
, what if I want all the fluxes? Or should that be d.f
where d
is the dendrogram object?
Regarding your last point, I think values
and indices
should be consistent. @braden's fork actually had something along the lines of values_self
and indices_self
to refer to just the ones in the branch itself with no substructures. Maybe that's a good way to go?
I'll start working on some of these points.
Quick question - for a (present) branch, is there any difference between merge_level and fmax? I can't think of one...
I think they are different: fmax is the maximum value of a pixel in a structure. merge_level is the contour at which this structure joins another (i.e. it's usually similar to fmin, probably?)
The height of a leaf is self.fmax - self.parent.merge_level
so for the parent branch, merge_level
is similar to fmax
. In fact, because we are creating the dendrogram from the top down, merge_level
gets set to the first flux used in a branch, which should therefore be fmax
(as in the maximum of the pixels in the actual branch structure, not sub-structures). It makes sense that the brightest pixel in a branch is the first pixel that wasn't part of a sub-structure, and is therefore the merge level. I guess I should try it out in practice to see if that's true, but that's how I interpret the code.
Remind me at what point we prune? After pruning, it is possible to have pixels in a structure brighter than the intensity where it splits into substructures -- they are noise spikes that don't get their own substructure. Likewise, it's possible to have interior noise (anti-)spikes that are fainter than the level at which a structure merges with another. So at various stages of the pruning/computing process, fmax, merge level, and fmin call all be distinct.
chris
On Fri, Jun 14, 2013 at 4:26 PM, Thomas Robitaille <notifications@github.com
wrote:
The height of a leaf is self.fmax - self.parent.merge_level so for the parent branch, merge_level is similar to fmax. In fact, because we are creating the dendrogram from the top down, merge_level gets set to the first flux used in a branch, which should therefore be fmax (as in the maximum of the pixels in the actual branch structure, not sub-structures). It makes sense that the brightest pixel in a branch is the first pixel that wasn't part of a sub-structure, and is therefore the merge level. I guess I should try it out in practice to see if that's true, but that's how I interpret the code.
— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/issues/17#issuecomment-19479733 .
Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont
Ah yes, you are correct - ignore me :)
I'm going to push a (not yet functional) version of components.py
to this issue so we can start looking at code. But first, what do you suggest as replacements for the following four terms:
ancestor
parent
children
descendents
?
ancestor
could be root
but I'm not sure about all the others.
Also, is trunk
actually a good name since the different structures in the trunk list are not actually related? What about just structures
?
Hmm, this is trickier than I thought! Some suggestions, though none are perfect
base
, trunk
, root
rootward
leafward
substructures
,
all_leafward
...?If none of these sound great to you, we could table this change for now
On Fri, Jun 14, 2013 at 4:57 PM, Thomas Robitaille <notifications@github.com
wrote:
I'm going to push a (not yet functional) version of components.py to this issue so we can start looking at code. But first, what do you suggest as replacements for the following four terms:
- ancestor
- parent
- children
- descendents
?
ancestor could be root but I'm not sure about all the others.
Also, is trunk actually a good name since the different structures in the trunk list are not actually related? What about just structures?
— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/issues/17#issuecomment-19481272 .
Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont
I think we should base ourselves on what is already used in graph theory as much as possible, so I've done a little research:
In a rooted tree, the parent of a vertex is the vertex connected to it on the path to the root; every vertex except the root has a unique parent. A child of a vertex v is a vertex of which v is the parent. (http://en.wikipedia.org/wiki/Graph-theoretical_tree)
so it seems those terms are pretty standard and are currently used the right way here. If we therefore go with this, then the current terminology of ancestor
and descendents
also makes sense (though see below). Maybe it's just a case of spelling out the terminology in the documentation?
I'm not too happy with trunk
though - if anything it should be trunks
because it's a list of disjointed trees. Or maybe it should be trees
. Or roots
? Actually the latter seems to be common terminology too:
This data structure defines a directed graph,[b] and for it to be a tree one must add a condition on its global structure (its topology), namely that at most one reference can point to any given node (a node has at most a single parent), and no node in the tree point to the root. In fact, every node (other than the root) must have exactly one parent, and the root must have no parents.
http://en.wikipedia.org/wiki/Tree_(data_structure)
So given this, we could go with:
root
(instead of ancestor
)parent
children
descendants
If we want to limit the child/parent
terminology, we could also go with:
roots
instead of trunk
root
(instead of ancestor
, points to the item in roots
, so consistent)parent
subtree
(instead of children
)get_flat_subtree()
(instead of descendants
)and I think parent
should be unambiguous because it's similar, and the concept of a 'parent' structure should be ok. What do you think?
I've started to add code here with the refactoring we were talking about before. Note that right now I only changed components.py
so of course all the tests will fail. One place I think it was nice to have the different Leaf and Branch class was that when one printed a list of structures, the different __repr__
made it clear what was what. However, I think this can easily be fixed by customizing __repr__
.
Just to simplify things for now, values
and indices
only returns the pixels not in sub-structures for branches (as before). I can change that now, but I first wanted to ensure that the tests passed with the remaining refactoring.
Ok, so far the code implements the change from Leaf/Branch to Structure that we discussed above. We can still re-instate sub-classes called Leaf
and Branch
easily if needed. This also implements the change from coords
to indices
and f
to values
and intensity
to data_value
.
Things that still need to be resolved:
Structure.indices
and Structure.values
should return the values for all sub-structures too, which I think we said it should? An alternative is to have get_indices()
and get_values()
that take a subtree
option.ancestor
, etc. My current preference is for the following terminology:
roots
instead of trunk
root
(instead of ancestor
, points to the item in roots
, so consistent)parent
subtree
(instead of children
)get_flat_subtree()
(instead of descendants
)Leaf
and Branch
(and maybe Root
) sub-classes that enforce certain conditions on Structure
, for convenience.@ChrisBeaumont - what do you think about the above issues?
Those checkboxes are cool.
First checkpoint:
It looks like this issue is currently addressed by values
and values_all
? Are you proposing changing that to something like values_single
and values
? Either convention is fine by me, as long as the computation is efficient (analysis code is going to need to have fast access to substructures)
Second checkpoint: The more I think of it, the more I wonder why I suggested changing parent/child terminology. I get tripped up on it sometimes, but you're right that they are standard graph terms, and better than the alternatives. I'm fine with keeping the current terminology as is for now
Third checkpoint:
Are there specific enforcement issues you had in mind? I'm partial to the single Structure
class, but I haven't tried out this code yet.
Given all the terminology change, I think we should add a unit test that instantiates a simple Structure
, and asserts that the interface is correct. I don't think such a test exists yet? I do see that the interface is exercised indirectly elsewhere.
Regarding your point about testing, I was indeed planning to create test_structure.py
which just tests Structure
and its interface and behavior extensively - I just didn't want to start until we'd converged somewhat on the interface.
First checkpoint: yes, the question is basically whether values
should be the values of all descendants or just of self by default. This is easy to switch so maybe we can indeed leave it as-is for now.
Second checkpoint: ok, let's leave it as-is
Third checkpoint: I guess that for example one cannot edit the children of a Leaf, and things like that. But then maybe that's a feature (being able to add children to a Leaf, turning it into a Branch).
@ChrisBeaumont - I've added preliminary tests for structure.py
, and am almost there, but I've run into a case I had to mark as xfail for now. The problem is that methods like get_peak
and get_npix
cache the result, so don't get updated after an _add_pixel()
call. Of course, _add_pixel
is currently private so that should be ok in theory, but it does feel somewhat unsafe to have methods that one has to trust the user not to use.
I wonder whether there is a better way to do it - for example, it's possible to hash Numpy arrays and do hash-based caching of npix and the peak position. However, we want to keep the lists of values and indices as lists to make it more efficient to append to them, so one could convert the list to a Numpy array in order to take the hash, but then we have to look at the performance penalty of doing this...
I think we can keep this for a separate pull request, including performance tests to see what makes most sense, but just wanted to raise this here and point out why I'd marked the test as xfail.
EDIT: we could also convert lists of indices and values to tuples of indices and values, which will then be hashable. Converting a list of 1000 elements to a tuple and finding the hash takes 30µs so maybe that's good enough to implement cache-based methods.
I think it might be simpler to add a "_clear_cache" method, which deletes the cache values. Then, call this from _add_pixel
.
Good point, that does sound like a cleaner solution. Though there's one remaining issue - even though I've made indices
and values
read-only properties, they are lists, so the lists themselves are mutable, so a user can technically do:
leaf.indices.append(1)
and corrupt the list of indices. There's a lot of issues with doing this beyond just the caching thing, so maybe we can just assume it won't happen (or better, return it as a tuple instead of a list to prevent this). What do you think?
This is beginning to overlap with the issue of what indices should return to most efficiently (time and space) index into numpy arrays. We should answer that question first before worrying about the append issue
Since we don't feel strongly about it for now, I'm going to leave values
and indices
as returning only the values attached to the structure (without sub-structures), and I'm going to not create Leaf
and Branch
for now, but will think about it more as I'm writing the docs.
I've now implemented the resetting of the cache and have unmarked the test as xfail.
I think some work needs to be done in future to try and make the structure properties more consistent internally in terms of how they compute values (recursion vs loops, caching vs no caching) but since this PR is concerned mostly with the public API, I think it's ready for a final review. I'd like to get this merged since it's quite a big PR, and it'll be easier to address some of the remaining points in separate PRs.
@ChrisBeaumont - agree about worrying about the append issue later. Once this is merged I'll open issues to track all the things that were suggested here.
@bradenmacdonald - could you review this too, since there are extensive changes?
Last night I started working on a more efficient way to return the indices for each structure. Once this is merged, I will open a new PR. That will be a good place to resolve the behavior for Structure.indices
@ChrisBeaumont - sounds good! So do the changes here seem ready to merge? If so, then we can leave it open a couple of days in case @bradenmacdonald wants to take a look before we merge.
+1 from me -- I like this new interface much better, and the issues that haven't yet been resolved need some more thought anyways
@bradenmacdonald @low-sky - I'll plan to merge this evening if there are no further comments regarding this pull request.
I'll merge now, but feel free to open issues if you disagree with any of the changes!
While writing the documentation, I found the use of the words
intensity
andcoords
and the use of the variablef
for the intensity/values (a legacy fromflux
) confusing. I would like to suggest the following change in convention:intensity
->value
fmin/fmax
->vmin/vmax
coords
->indices
the reason for the first two is that when doing a PPP dendrogram on a density cube, density is neither a flux or intensity, so I find the use of the word
value
more general. Similarly,coords
is confusing to me because if we ever want to use 'real' coordinates (e.g. position in PPP space or on the sky) that will be the natural word, whereas what we are returning right now really are indices.Leaves would have a
values
attribute that would replacef
, so that one would then accessleaf.indices
andleaf.values
rather thanleaf.coords
andleaf.f
.What do other people think about this? I can implement the changes if we decide to do this.