dendrograms / astrodendro

Generate a dendrogram from a dataset
https://dendrograms.readthedocs.io/
Other
37 stars 38 forks source link

Encourage to split compute and analysis and make clear that structure IDs can change at every compute #84

Closed astrofrog closed 10 years ago

astrofrog commented 10 years ago

I think this can still do with some work, as currently one of the examples uses the idx assuming it is deterministic.

astrofrog commented 10 years ago

Suggestions welcome!

ChrisBeaumont commented 10 years ago

I think we should rework the examples so that they use structure_at to retrieve structures by position, instead of making the more fragile assumption that the idx value is stable.

Likewise, I worry that this PR draws too much attention to the fact that structure IDs are unstable -- this seems like the kind of gotcha that should be in a warning or note tag, instead of emphasized in the main text. Does the average reader need to know about this on his or her first read? @keflavich is in a better position to comment on that.

astrofrog commented 10 years ago

@ChrisBeaumont - I need to check this later, but how exactly does structure_at work conceptually - does it return the smallest structure containing the point? I could imagine having the interactive viewer return the command needed to retrieve the structure instead of the structure ID.

astrofrog commented 10 years ago

So for the record the other option we can consider is to assign a simple hash to each structure that is based on the indices of the pixels in that structure. This would be stable irrespective of the algorithm. It's a question of whether it's easier to tell people to select structure '3a4b26' or to find out the pixel which uniquely identifies the structure (a pixel can be contained in many structures at different levels, but one has to realize that choosing the pixel will select the smallest structure which does). Of course, we could implement both.

ChrisBeaumont commented 10 years ago

I don't want to change index values to hashes, since I think we will eventually want to pack structures into numpy arrays, and use integer indexes.

On Mon, Sep 23, 2013 at 9:38 AM, Thomas Robitaille <notifications@github.com

wrote:

So for the record the other option we can consider is to assign a simple hash to each structure that is based on the indices of the pixels in that structure. This would be stable irrespective of the algorithm. It's a question of whether it's easier to tell people to select structure '3a4b26' or to find out the pixel which uniquely identifies the structure (a pixel can be contained in many structures at different levels, but one has to realize that choosing the pixel will select the smallest structure which does). Of course, we could implement both.

— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/pull/84#issuecomment-24920033 .


Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont


keflavich commented 10 years ago

I'm in favor of a structure_at approach - identification by pixel is great. Of course, you can really only identify a tree, not an individual leaf or branch, by pixel, since pixels are not unique to a single node. But presumably you could have, for a given pixel, 1...N where 1 corresponds to the smallest structure and N to the largest, and there would be non-sequential numbers at mergers?

Regarding the in-text warnings about idxs, I like the way it's laid out now. Given that some dendrograms are expensive to compute, I do want to know whether idxs persist and how to find them. I think the documentation addition is a good thing.

ChrisBeaumont commented 10 years ago

structure_at returns the smallest structure overlapping that pixel

astrofrog commented 10 years ago

The question is, when someone is using the interactive visualizer, how do they know what pixels uniquely identify the highlighted structure? Should that be printed to the window instead of the ID?

keflavich commented 10 years ago

I like that idea: print the 2- or 3-tuple with the peak coordinate (since that's useful information anyway), plus the hash?

On Mon, Sep 23, 2013 at 3:48 PM, Thomas Robitaille <notifications@github.com

wrote:

The question is, when someone is using the interactive visualizer, how do they know what pixels uniquely identify the highlighted structure? Should that be printed to the window instead of the ID?

— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/pull/84#issuecomment-24958017 .

Adam

astrofrog commented 10 years ago

Ah, but the peak coordinate can't be used - we have to use some random coordinate that just happens to be part of the structure and defines it because it's the smallest structure containing that pixel.

keflavich commented 10 years ago

Right, sorry, I had an incorrect mental picture. That's less useful, then. It would still probably be useful to be able to index by a pixel within the structure, but a hash is just as good at that point.

On Mon, Sep 23, 2013 at 4:01 PM, Thomas Robitaille <notifications@github.com

wrote:

Ah, but the peak coordinate can't be used - we have to use some random coordinate that just happens to be part of the structure and defines it because it's the smallest structure containing that pixel.

— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/pull/84#issuecomment-24958971 .

Adam

ChrisBeaumont commented 10 years ago

Personally, anything involving a hash feels very "programm-y" and I worry about it intimidating users.

On Mon, Sep 23, 2013 at 6:04 PM, Adam Ginsburg notifications@github.comwrote:

Right, sorry, I had an incorrect mental picture. That's less useful, then. It would still probably be useful to be able to index by a pixel within the structure, but a hash is just as good at that point.

On Mon, Sep 23, 2013 at 4:01 PM, Thomas Robitaille < notifications@github.com

wrote:

Ah, but the peak coordinate can't be used - we have to use some random coordinate that just happens to be part of the structure and defines it because it's the smallest structure containing that pixel.

— Reply to this email directly or view it on GitHub< https://github.com/dendrograms/dendro-core/pull/84#issuecomment-24958971> .

Adam

— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/pull/84#issuecomment-24959247 .


Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont


keflavich commented 10 years ago

I feel like a tuple is going to be just as bad: (13,49,36) vs 'e3vc6a' Perhaps it would be best to come up with a unique mapping to nice names ('src1','src2','colin') along with the hash?

On Mon, Sep 23, 2013 at 5:17 PM, Chris Beaumont notifications@github.comwrote:

Personally, anything involving a hash feels very "programm-y" and I worry about it intimidating users.

On Mon, Sep 23, 2013 at 6:04 PM, Adam Ginsburg notifications@github.comwrote:

Right, sorry, I had an incorrect mental picture. That's less useful, then. It would still probably be useful to be able to index by a pixel within the structure, but a hash is just as good at that point.

On Mon, Sep 23, 2013 at 4:01 PM, Thomas Robitaille < notifications@github.com

wrote:

Ah, but the peak coordinate can't be used - we have to use some random coordinate that just happens to be part of the structure and defines it because it's the smallest structure containing that pixel.

— Reply to this email directly or view it on GitHub< https://github.com/dendrograms/dendro-core/pull/84#issuecomment-24958971>

.

Adam

— Reply to this email directly or view it on GitHub< https://github.com/dendrograms/dendro-core/pull/84#issuecomment-24959247> .


Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont


Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/pull/84#issuecomment-24963380 .

Adam

astrofrog commented 10 years ago

Note that if we use a hash we can make it purely numerical so that it looks just the same as the current structure iDs. I'm going to try different things out to see what would be more natural.

astrofrog commented 10 years ago

Ok, so here are the options I can think of:

One advantage of the first two is that if the dendrogram is recomputed, it's unlikely the same hash will work again, so this will raise flags with the user. However, in the second case, getting a structure based on the pixel position will always return something, but that something may be different after each compute run.

@ChrisBeaumont @keflavich - any preference/vote?

ChrisBeaumont commented 10 years ago

I'd prefer a scheme that densely packs indices from 0-n, so that we can store them in numpy arrays, and easily index them if we want to. One easy way to do this is to transform hashes via np.argsort. I also think we should avoid any scheme that has to touch every pixel when computing a hash, since that's expensive.

On Tue, Sep 24, 2013 at 2:26 PM, Thomas Robitaille <notifications@github.com

wrote:

Ok, so here are the options I can think of:

-

Alphanumeric hash with e.g. 8 characters: 8de3b197 (computed from all pixels in a structure, unique to that exact structure (if the structure changes in a subsequent run, the hash changes)

Numeric hash, e.g. 39563451 but use a long integer instead of hexadecimal as for the hash.

Tuple of pixels, e.g. (43, 455, 123), which is used to define the smallest structure containing this pixel

Prettified set of pixel positions, e.g. 43-455-123 which is easier to show in an interactive viewer than a tuple.

One advantage of the first two is that if the dendrogram is recomputed, it's unlikely the same hash will work again, so this will raise flags with the user. However, in the second case, getting a structure based on the pixel position will always return something, but that something may be different after each compute run.

@ChrisBeaumont https://github.com/ChrisBeaumont @keflavichhttps://github.com/keflavich- any preference/vote?

— Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/pull/84#issuecomment-25030084 .


Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont


keflavich commented 10 years ago

I like Chris's idea: alphanum hash, but made indexable in arrays. The array packing would really make indexing much more intuitive.

On Tue, Sep 24, 2013 at 4:21 PM, Chris Beaumont notifications@github.comwrote:

I'd prefer a scheme that densely packs indices from 0-n, so that we can store them in numpy arrays, and easily index them if we want to. One easy way to do this is to transform hashes via np.argsort. I also think we should avoid any scheme that has to touch every pixel when computing a hash, since that's expensive.

On Tue, Sep 24, 2013 at 2:26 PM, Thomas Robitaille < notifications@github.com

wrote:

Ok, so here are the options I can think of:

Alphanumeric hash with e.g. 8 characters: 8de3b197 (computed from all pixels in a structure, unique to that exact structure (if the structure

changes in a subsequent run, the hash changes)

Numeric hash, e.g. 39563451 but use a long integer instead of

hexadecimal as for the hash.

Tuple of pixels, e.g. (43, 455, 123), which is used to define the

smallest structure containing this pixel

Prettified set of pixel positions, e.g. 43-455-123 which is easier to show in an interactive viewer than a tuple.

One advantage of the first two is that if the dendrogram is recomputed, it's unlikely the same hash will work again, so this will raise flags with the user. However, in the second case, getting a structure based on the pixel position will always return something, but that something may be different after each compute run.

@ChrisBeaumont https://github.com/ChrisBeaumont @keflavich< https://github.com/keflavich>- any preference/vote?

— Reply to this email directly or view it on GitHub< https://github.com/dendrograms/dendro-core/pull/84#issuecomment-25030084> .


Chris Beaumont Graduate Student Institute for Astronomy University of Hawaii at Manoa 2680 Woodlawn Drive Honolulu, HI 96822 www.ifa.hawaii.edu/~beaumont


Reply to this email directly or view it on GitHubhttps://github.com/dendrograms/dendro-core/pull/84#issuecomment-25047527 .

Adam

astrofrog commented 10 years ago

Ok, let me see what I can do!

astrofrog commented 10 years ago

@ChrisBeaumont @keflavich - check out the changes in structure.py and dendrogram.py. This is my attempt at defining a deterministic ID for structures. The idea is that for each structure, we pick the pixel belonging to the structure itself (excluding subtrees) that comes first when sorting the indices (safer than picking e.g. the faintest, if there are two fluxes the same). Then, we again sort the structures according to this unique index for each structure.

The names identifying_pixel and clean_id are of course temporary, but this brings up the question - do we keep the idx as-is, and if not, do we need to update the index map based on the new IDs? (this could be the bottleneck). Alternatively we can keep idx as a private property that matches the index map, while using id for the new public ID. However, people might be confused when looking at the index map.

What do you think?

astrofrog commented 10 years ago

@ChrisBeaumont - ok, I've implemented it a bit differently following your suggestions. Now we compute the smallest_index for each structure on-the-fly, much like vmin and vmax. The performance impact of this is so small I wasn't actually able to quantify it since noise from run to run dominated (even averaging over 10 runs).

At the end of the dendrogram computation, the idx is recomputed for every structure. The performance impact of this is a 10% slowdown (mostly due to re-assigning the indices in the index map). In my opinion, this is an acceptable performance slowdown.

What do you think?

astrofrog commented 10 years ago

Oh, one thing, the idx can only go from 0 to n-1 if we change the null value (i.e. no structure) to e.g. -1. What do you think?

astrofrog commented 10 years ago

If we agree on this approach, I will remove the changes to the docs since the structure IDs then become fully deterministic.

ChrisBeaumont commented 10 years ago

:+1: -- finally starting to like this :)

astrofrog commented 10 years ago

@ChrisBeaumont - this is ready for review! There is one change you might want to check, which is debac95 - what confuses me is that the change doesn't seem to matter, i.e. tests pass either way. Can you confirm whether this change should be made?

astrofrog commented 10 years ago

@ChrisBeaumont - just a reminder about this pull request. I think we're now finally close to a release! (which I'm happy to tidy up once the few open pull requests have been resolved)

ChrisBeaumont commented 10 years ago

I'm also surprised that debac95 didn't break anything, but it may just mean none of the index tests checked the specific value of pixels outside the dendrogram. This looks good to merge

astrofrog commented 10 years ago

Ok, thanks!