dendrograms / astrodendro

Generate a dendrogram from a dataset
https://dendrograms.readthedocs.io/
Other
37 stars 38 forks source link

Minor change in API #17

Closed astrofrog closed 11 years ago

astrofrog commented 11 years ago

While writing the documentation, I found the use of the words intensity and coords and the use of the variable f for the intensity/values (a legacy from flux) confusing. I would like to suggest the following change in convention:

the reason for the first two is that when doing a PPP dendrogram on a density cube, density is neither a flux or intensity, so I find the use of the word value more general. Similarly, coords is confusing to me because if we ever want to use 'real' coordinates (e.g. position in PPP space or on the sky) that will be the natural word, whereas what we are returning right now really are indices.

Leaves would have a values attribute that would replace f, so that one would then access leaf.indices and leaf.values rather than leaf.coords and leaf.f.

What do other people think about this? I can implement the changes if we decide to do this.

ChrisBeaumont commented 11 years ago

Fine by me.

Some more questions along these lines: should Branch really be a sublcass of Leaf? This seems backwards, conceptually.

Branch and Leaf have the same interface, except that Branch has a descendants method. This might not be the best name either, since descendents in a family tree point rootward, while this method points leafward.

I vaguely remember discussing some of this with @bradenmacdonald, so he may want to chime in. I didn't see any specific discussion of this on the wiki. But I propose the following additional refactorings:

Whether or not the last 2 points make sense depends on how much extra logic is required to handle leaf and non-leaf cases in the body of methods like fill_footprint.

astrofrog commented 11 years ago

I was actually going to raise the same point about inheritance - I got confused initially today when isinstance(item, Leaf) gave me True for a branch! So I agree that we should use a Structure base class. I also think that the trunk should be a sub-class of Structure (that is distinct from Branch because in the Trunk, there is no connection between different objects).

descendent is actually not always rootward, since there are several ways to plot family trees, but because of this ambiguity, I agree we should be using leafward/rootward.

I kind of like having the Leaf class, as it's nicer (in my opinion) to check that something is a Leaf than to check it has zero descendents.

Another question (related to something you asked in an email) - when one accesses the values (coords and f currently) for a branch, should it be returning the values for all leafward items? My feeling is that it should, but it doesn't currently.

I can try and open a pull request that implements a lot of this, so we can see how it would look.

ChrisBeaumont commented 11 years ago

Agreed that isinstance(s, Leaf) is more meaningful than len(s.leafwards) == 0. That's partially why I suggested adding the leaf boolean property - then the test would just be if s.leaf.

My deeper question is whether the conceptual distinction between a Branch and a Leaf is meaningful enough that we should have 2 different types. My interface preference is to use a single Structure class, unless the implementation is too complicated.

I also have mixed feelings about Trunk, since it seems like it just boils down to a list of base-level structures -- cant this just be an attribute inside Dendrogram? Maybe if I see the code for Trunk, it will make more sense.

I think structure.f (or structure.value or whatever it gets renamed to) should include the contribution from substructures. coords is less obvious -- I commonly want indices including and excluding substructures. Many of the methods have a subtree=False to toggle this option, so maybe we need a new method to return both forms of coordinates as well.

astrofrog commented 11 years ago

I agree that there isn't really a fundamental difference between a leaf and a branch, and having everything be a structure instead means it's easier to make a leaf into a branch if needed. I would prefer is_leaf as an attribute to make it clearer it will return a boolean.

Regarding the trunk, my thinking was, what if I want e.g. all the flux values in the dendrogram? Since I can do e.g. leaf.f and branch.f, what if I want all the fluxes? Or should that be d.f where d is the dendrogram object?

Regarding your last point, I think values and indices should be consistent. @braden's fork actually had something along the lines of values_self and indices_self to refer to just the ones in the branch itself with no substructures. Maybe that's a good way to go?

I'll start working on some of these points.

astrofrog commented 11 years ago

Quick question - for a (present) branch, is there any difference between merge_level and fmax? I can't think of one...

ChrisBeaumont commented 11 years ago

I think they are different: fmax is the maximum value of a pixel in a structure. merge_level is the contour at which this structure joins another (i.e. it's usually similar to fmin, probably?)

astrofrog commented 11 years ago

The height of a leaf is self.fmax - self.parent.merge_level so for the parent branch, merge_level is similar to fmax. In fact, because we are creating the dendrogram from the top down, merge_level gets set to the first flux used in a branch, which should therefore be fmax (as in the maximum of the pixels in the actual branch structure, not sub-structures). It makes sense that the brightest pixel in a branch is the first pixel that wasn't part of a sub-structure, and is therefore the merge level. I guess I should try it out in practice to see if that's true, but that's how I interpret the code.

ChrisBeaumont commented 11 years ago

Remind me at what point we prune? After pruning, it is possible to have pixels in a structure brighter than the intensity where it splits into substructures -- they are noise spikes that don't get their own substructure. Likewise, it's possible to have interior noise (anti-)spikes that are fainter than the level at which a structure merges with another. So at various stages of the pruning/computing process, fmax, merge level, and fmin call all be distinct.

chris

On Fri, Jun 14, 2013 at 4:26 PM, Thomas Robitaille <notifications@github.com

wrote:

The height of a leaf is self.fmax - self.parent.merge_level so for the parent branch, merge_level is similar to fmax. In fact, because we are creating the dendrogram from the top down, merge_level gets set to the first flux used in a branch, which should therefore be fmax (as in the maximum of the pixels in the actual branch structure, not sub-structures). It makes sense that the brightest pixel in a branch is the first pixel that wasn't part of a sub-structure, and is therefore the merge level. I guess I should try it out in practice to see if that's true, but that's how I interpret the code.

— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/issues/17#issuecomment-19479733 .


Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont


astrofrog commented 11 years ago

Ah yes, you are correct - ignore me :)

astrofrog commented 11 years ago

I'm going to push a (not yet functional) version of components.py to this issue so we can start looking at code. But first, what do you suggest as replacements for the following four terms:

?

ancestor could be root but I'm not sure about all the others.

Also, is trunk actually a good name since the different structures in the trunk list are not actually related? What about just structures?

ChrisBeaumont commented 11 years ago

Hmm, this is trickier than I thought! Some suggestions, though none are perfect

If none of these sound great to you, we could table this change for now

On Fri, Jun 14, 2013 at 4:57 PM, Thomas Robitaille <notifications@github.com

wrote:

I'm going to push a (not yet functional) version of components.py to this issue so we can start looking at code. But first, what do you suggest as replacements for the following four terms:

  • ancestor
  • parent
  • children
  • descendents

?

ancestor could be root but I'm not sure about all the others.

Also, is trunk actually a good name since the different structures in the trunk list are not actually related? What about just structures?

— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/issues/17#issuecomment-19481272 .


Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont


astrofrog commented 11 years ago

I think we should base ourselves on what is already used in graph theory as much as possible, so I've done a little research:

In a rooted tree, the parent of a vertex is the vertex connected to it on the path to the root; every vertex except the root has a unique parent. A child of a vertex v is a vertex of which v is the parent. (http://en.wikipedia.org/wiki/Graph-theoretical_tree)

so it seems those terms are pretty standard and are currently used the right way here. If we therefore go with this, then the current terminology of ancestor and descendents also makes sense (though see below). Maybe it's just a case of spelling out the terminology in the documentation?

I'm not too happy with trunk though - if anything it should be trunks because it's a list of disjointed trees. Or maybe it should be trees. Or roots? Actually the latter seems to be common terminology too:

This data structure defines a directed graph,[b] and for it to be a tree one must add a condition on its global structure (its topology), namely that at most one reference can point to any given node (a node has at most a single parent), and no node in the tree point to the root. In fact, every node (other than the root) must have exactly one parent, and the root must have no parents.

http://en.wikipedia.org/wiki/Tree_(data_structure)

So given this, we could go with:

If we want to limit the child/parent terminology, we could also go with:

and I think parent should be unambiguous because it's similar, and the concept of a 'parent' structure should be ok. What do you think?

astrofrog commented 11 years ago

I've started to add code here with the refactoring we were talking about before. Note that right now I only changed components.py so of course all the tests will fail. One place I think it was nice to have the different Leaf and Branch class was that when one printed a list of structures, the different __repr__ made it clear what was what. However, I think this can easily be fixed by customizing __repr__.

astrofrog commented 11 years ago

Just to simplify things for now, values and indices only returns the pixels not in sub-structures for branches (as before). I can change that now, but I first wanted to ensure that the tests passed with the remaining refactoring.

astrofrog commented 11 years ago

Ok, so far the code implements the change from Leaf/Branch to Structure that we discussed above. We can still re-instate sub-classes called Leaf and Branch easily if needed. This also implements the change from coords to indices and f to values and intensity to data_value.

Things that still need to be resolved:

@ChrisBeaumont - what do you think about the above issues?

ChrisBeaumont commented 11 years ago

Those checkboxes are cool.

First checkpoint: It looks like this issue is currently addressed by values and values_all? Are you proposing changing that to something like values_single and values? Either convention is fine by me, as long as the computation is efficient (analysis code is going to need to have fast access to substructures)

Second checkpoint: The more I think of it, the more I wonder why I suggested changing parent/child terminology. I get tripped up on it sometimes, but you're right that they are standard graph terms, and better than the alternatives. I'm fine with keeping the current terminology as is for now

Third checkpoint: Are there specific enforcement issues you had in mind? I'm partial to the single Structure class, but I haven't tried out this code yet.

Given all the terminology change, I think we should add a unit test that instantiates a simple Structure, and asserts that the interface is correct. I don't think such a test exists yet? I do see that the interface is exercised indirectly elsewhere.

astrofrog commented 11 years ago

Regarding your point about testing, I was indeed planning to create test_structure.py which just tests Structure and its interface and behavior extensively - I just didn't want to start until we'd converged somewhat on the interface.

First checkpoint: yes, the question is basically whether values should be the values of all descendants or just of self by default. This is easy to switch so maybe we can indeed leave it as-is for now.

Second checkpoint: ok, let's leave it as-is

Third checkpoint: I guess that for example one cannot edit the children of a Leaf, and things like that. But then maybe that's a feature (being able to add children to a Leaf, turning it into a Branch).

astrofrog commented 11 years ago

@ChrisBeaumont - I've added preliminary tests for structure.py, and am almost there, but I've run into a case I had to mark as xfail for now. The problem is that methods like get_peak and get_npix cache the result, so don't get updated after an _add_pixel() call. Of course, _add_pixel is currently private so that should be ok in theory, but it does feel somewhat unsafe to have methods that one has to trust the user not to use.

I wonder whether there is a better way to do it - for example, it's possible to hash Numpy arrays and do hash-based caching of npix and the peak position. However, we want to keep the lists of values and indices as lists to make it more efficient to append to them, so one could convert the list to a Numpy array in order to take the hash, but then we have to look at the performance penalty of doing this...

I think we can keep this for a separate pull request, including performance tests to see what makes most sense, but just wanted to raise this here and point out why I'd marked the test as xfail.

EDIT: we could also convert lists of indices and values to tuples of indices and values, which will then be hashable. Converting a list of 1000 elements to a tuple and finding the hash takes 30µs so maybe that's good enough to implement cache-based methods.

ChrisBeaumont commented 11 years ago

I think it might be simpler to add a "_clear_cache" method, which deletes the cache values. Then, call this from _add_pixel.

astrofrog commented 11 years ago

Good point, that does sound like a cleaner solution. Though there's one remaining issue - even though I've made indices and values read-only properties, they are lists, so the lists themselves are mutable, so a user can technically do:

leaf.indices.append(1)

and corrupt the list of indices. There's a lot of issues with doing this beyond just the caching thing, so maybe we can just assume it won't happen (or better, return it as a tuple instead of a list to prevent this). What do you think?

ChrisBeaumont commented 11 years ago

This is beginning to overlap with the issue of what indices should return to most efficiently (time and space) index into numpy arrays. We should answer that question first before worrying about the append issue

astrofrog commented 11 years ago

Since we don't feel strongly about it for now, I'm going to leave values and indices as returning only the values attached to the structure (without sub-structures), and I'm going to not create Leaf and Branch for now, but will think about it more as I'm writing the docs.

I've now implemented the resetting of the cache and have unmarked the test as xfail.

I think some work needs to be done in future to try and make the structure properties more consistent internally in terms of how they compute values (recursion vs loops, caching vs no caching) but since this PR is concerned mostly with the public API, I think it's ready for a final review. I'd like to get this merged since it's quite a big PR, and it'll be easier to address some of the remaining points in separate PRs.

@ChrisBeaumont - agree about worrying about the append issue later. Once this is merged I'll open issues to track all the things that were suggested here.

@bradenmacdonald - could you review this too, since there are extensive changes?

ChrisBeaumont commented 11 years ago

Last night I started working on a more efficient way to return the indices for each structure. Once this is merged, I will open a new PR. That will be a good place to resolve the behavior for Structure.indices

astrofrog commented 11 years ago

@ChrisBeaumont - sounds good! So do the changes here seem ready to merge? If so, then we can leave it open a couple of days in case @bradenmacdonald wants to take a look before we merge.

ChrisBeaumont commented 11 years ago

+1 from me -- I like this new interface much better, and the issues that haven't yet been resolved need some more thought anyways

astrofrog commented 11 years ago

@bradenmacdonald @low-sky - I'll plan to merge this evening if there are no further comments regarding this pull request.

astrofrog commented 11 years ago

I'll merge now, but feel free to open issues if you disagree with any of the changes!