dendrograms / astrodendro

Generate a dendrogram from a dataset
https://dendrograms.readthedocs.io/
Other
37 stars 38 forks source link

Implement mass vs radius plotting, etc. #51

Open keflavich opened 11 years ago

keflavich commented 11 years ago

This is related to the PR I submitted on the old version of astrodendro, but basically: I typically want to run dendrograms on 2D maps to see, as a function of contour level, how much mass is in that contour. It's a very convenient way to make density-threshold based cutoffs (see, e.g., Jens Kauffman's recent papers).

Anyway, to do this, you need some tree-walking code, which I don't think is presently in astrodendro. Here's a sample that I used:

def get_mostest_children(item, props=('npix','f_sum'), mostest='f_sum'):
    """
    Return certain properties of all children selecting the mostest
    of something (e.g., brightest)
    """
    item.get_dens = lambda: item.values().sum()/item.get_npix()**(1.5)
    item.get_f_sum = lambda: item.values().sum()
    if item.is_leaf:
        d = dict([(p,[get(item,p)]) for p in props])
        d['branch'] = [item]
        return d
    else:
        brightest = item.children[0]
        brightest.get_dens = lambda: brightest.values().sum()/item.get_npix()**(1.5)
        brightest.get_f_sum = lambda: brightest.values().sum()
        for child in item.children[1:]:
            child.get_dens = lambda: child.values().sum()/item.get_npix()**(1.5)
            child.get_f_sum = lambda: child.values().sum()
            if get(child,mostest) > get(brightest,mostest):
                brightest = child
        brightest_props = get_mostest_children(brightest)
        d = dict([(p,[get(item,p)] + brightest_props[p])
                     for p in props])
        d['branch'] = [item] + brightest_props['branch']
        return d

(where I think get is a matplotlib-only convenience function...).

I'm not submitting this as a PR yet, in part because I think there are some "convenience functions" that ought to be implemented, e.g.: Structure.get_sum(self): return self.values().sum() and maybe some mechanism for computing other values, e.g. the density as I've done above, for a given branch/tree.

I've used this approach effectively already:

grsmc_43 30_dendrograms

ChrisBeaumont commented 11 years ago

Hi @keflavich

The ability to sample properties of structures Structures "in-between" the merger contours is a functionality that we definitely should add. I think the analysis module is well-suited to do this, as it already implements extracting masses, fluxes, areas, etc from Structures. The only assumption that the analysis module makes about Structures is that they have 3 attributes: idx (identifier), indices (pixel locations), and values (pixel intensities).

Thus, I think the best way to implement what you're talking about is to write a few utility functions that take a dendrogram and iterate over sub-sampled structures. For example, something like this:


def subsample(dendrogram, subsamples_per_structure = 10):
    for structure in dendrogram:
        for i in linsspace(0, 100, subsamples_per_structure):
            idx = '%i_%i' % (structure.idx, i)
            values = (brightest ith percentile of pixels in structure)
            indices = (locations of values)
            yield Structure(indices, values, idx=idx)

cat = ppv_catalog(subsample(dendrogram), metadata)

This isn't exactly what you need, since you still need a few more lines of code to associate which rows in the catalog are part of the same structure, for plotting lines. That's fairly straightforward, however (maybe you want one more convenience function to turn a subsampled catalog into a list of lists, or something like that).

Do you think this general approach meets your use case?

keflavich commented 11 years ago

I'm not really sure. You seem to be creating new associations between structures... why do you need new idx's? In the example I showed, all I did was take a single branch and iterate to its leaves, taking the brightest child at each level. I don't think there's any need to create new structures, but perhaps you're doing that to enforce immutability?

(also, my example case may not be easy for PPV analysis, since n_pix is not proportional to the projected area)

keflavich commented 11 years ago

I should clarify, the lambda's I threw on to each structure are just convenience functions to allow one to easily switch between selecting on, say, the most pixels vs the greatest sum of pixels vs. some other quantity. Really, the function would be more readable as:

def get_biggest_children(item, props=('npix','f_sum')):
    """
    Return certain properties of all children selecting them based on their size
    (i.e., greatest # of pixels)
    """
    if item.is_leaf:
        d = dict([(p,[get(item,p)]) for p in props])
        d['branch'] = [item]
        return d
    else:
        biggest = item.children[0]
        for child in item.children[1:]:
            if child.npix > biggest.npix:
                biggest = child
        biggest_props = get_biggest_children(biggest)
        d = dict([(p,[get(item,p)] + biggest_props[p])
                     for p in props])
        d['branch'] = [item] + biggest_props['branch']
        return d

where the result would look something like this:

In [534]: get_mostest_children(flatdend.structures_dict[10897],mostest='npix')
Out[534]:
{'branch': [<Structure type=branch idx=10897>,
  <Structure type=branch idx=12332>,
  <Structure type=branch idx=13000>,
  <Structure type=branch idx=11609>,
  <Structure type=branch idx=12996>,
  <Structure type=leaf idx=12571>],
 'f_sum': [1.1652813e+24,
  9.1452313e+23,
  6.7597085e+23,
  5.8008197e+23,
  4.3262043e+23,
  2.3222251e+23],
 'npix': [346, 269, 187, 158, 116, 56]}

(where I've run this on a map in column-density units)

The dictionary entries f_sum and npix are now easily plottable as arrays.

ChrisBeaumont commented 11 years ago

Ah, I think I see what you are getting at here. I thought you were referring to sub-sampling structures at finer contour levels (something levelprops does, and we should too). But I see now that you are traversing a single path from a branch to a leaf, and extracting properties.

The main point I was trying to make is that the pp_catalog and ppv_catalog functions take care of the logic of extracting quantities like flux, area, etc from structures. The input is an iterable of structures (where a Structure is generically a collection of pixel locations and intensities). Thus, I think best way to implement functions like get_biggest_children is to extract the specific structure path (idx 10897 --> 12571), and then pass this list to pp_catalog. In pseudocode:

def biggest_path(structure):
    '''traverse structure to leaf, choosing biggest child at each step'''
    if structure.is_leaf:
        return [structure]
   return [structure] + biggest_path(max(structure.children, key=lambda x: len(x.values)))

cat = pp_catalog(biggest_path(structure), metadata, fields=['flux', 'npix'])

The other (less relevant) point I was making is that you can also build new structures that subsample the canonical structures in the dendrogram, if you wanted more than 6 samples along this path)

keflavich commented 11 years ago

OK, that makes sense now.

The pseudocode is pretty much what I had implemented. I didn't know max had a key kwarg... that's clever!