dendrograms / astrodendro

Generate a dendrogram from a dataset
https://dendrograms.readthedocs.io/
Other
37 stars 38 forks source link

Basic analysis interface #22

Closed ChrisBeaumont closed 11 years ago

ChrisBeaumont commented 11 years ago

This PR is only for discussion.

I've translated the discussion on https://github.com/dendrograms/dendro-core/wiki/Structure-analysis-API-proposal to a basic analysis module interface. What do we think about this approach? Very quickly:

1) It relies on the fact that Structure objects have a indices and values tag, which list the locations and data values at each point in a structure 2) It requires implementing a dendrogram.__iter__ method, that iterates over structures in a dendrogram.

However, most of the code doesn't assume anything about the astrodendro API. We could write other functions to generate collections of structures (for example, sub-sampling dendrogram structures between merger points), and this machinery would still apply.

The main question I have is: can you think of specific quantities that cannot easily be calculated given these current function signatures?

astrofrog commented 11 years ago

Now that we are starting to talk about physical quantities - at what level do we keep track of the WCS of the images, which will be needed to talk about things like physical size, velocity dispersion, etc? Is it something that should be kept in the dendrogram, or in the analysis tool? I guess it gets passed to the cataloguing function?

In the API document, you mentioned cataloguing classes, but you are using a cataloguing function here - is that deliberate?

ChrisBeaumont commented 11 years ago

Re class vs function: I like to start out with functions, since they have less state. ppv_catalog doesn't yet merit a class, IMO

We have a few options for WCS

d = Dendrogram.compute(...)
md = metadata_from_wcs(wcs, **extra_metadata)
catalog = ppv_catalog(d, md)

or, perhaps

d = Dendrogram.compute(...)
catalog = ppv_catalog(d, wcs=wcs, **extra_metadata)

or, as you suggest in #4

d = FITSDendrogram.compute(..., wcs=wcs, **extra_metadata)
catalog = ppv_catalog(d)  # or d.catalog()

What do you think? Option 2 is probably better than option 1. I prefer 2 over 3, in the interest of keeping dendrogram classes single-responsibility.

astrofrog commented 11 years ago

Option 2 looks good to me, and simpler for the user.

astrofrog commented 11 years ago

For the results of the cataloguer, one could also consider using an Astropy table, which would be easier to write out to various formats, and easier to also modify if needed compared to a Numpy structured array. I don't like lists of dicts, since it would be hard to extract e.g. mean mass for all structures.

astrofrog commented 11 years ago

Since we have to use WCS here anyway and therefore will need to rely on Astropy objects, one could also require things like the distance to be passed using Astropy quantities, i.e.

catalog = ppv_catalog(d, wcs=wcs, distance=300 * u.pc)

Then if you return the results as an Astropy table, the columns can also have units, e.g. mass could be given in solar masses, etc. I think it would be very convenient to make use of Astropy here, this is the kind of application the units and tables were designed for.

ChrisBeaumont commented 11 years ago

Both astropy quantities and astropy tables would be useful here. The question is whether astropy should be a required dependency.

The units stuff would be easy to take advantage of without requiring (we assert proper units in metadata, if units are provided).

The list of dicts is not super useful, but can easily be transformed into lots of things (astropy table, pandas data frame, numpy recarray). Should we perform the transformation inside the function? Which object should we use? Let the user choose the output? Expect the user to transform the output manually?

astrofrog commented 11 years ago

Astropy could be a dependency inside the cataloguing function. The user will need it anyway if they want to pass a WCS object in.

I'm biased, but I personally feel that there would be a lot of benefits from requiring Astropy as a dependency for the analysis tools. We don't have to make it a required dependency for the whole package, just the analysis tools. It's going to be increasingly hard to find astronomers using Python who don't have Astropy installed since it's going to become a dependency for a number of other packages anyway, so I don't think it's the same as requiring for example a GUI toolkit.

For the format to return, again I would suggest Astropy tables by default. They can be converted easily to Numpy recarray (the opposite is not true since we can't store meta-data in recarrays) and we plan to make Astropy tables easily convertible to Pandas DataFrame. It's also very easy for users to write common Astronomy formats from them, and even LaTeX!

ChrisBeaumont commented 11 years ago

Ok, I'm convinced. Let's use option 2 above for the catalog interface. Let's return astropy tables by default within the catalogers. While it's still easy, we can fall back to lists of dicts if astropy isn't available.

astrofrog commented 11 years ago

@ChrisBeaumont - sounds good! I'm fine with reverting to simpler structures if Astropy is not available, though is there any reason not to use numpy structured arrays? Just less flexible?

ChrisBeaumont commented 11 years ago

You're right, numpy structured arrays would be fine

low-sky commented 11 years ago

Another outsider comment: I think an astropy dependency inside, e.g., PPVCataloger, is fine. I think there are fewer cases where things can go wrong if you're relying on the units and tables framework. I've been in love with astropy tables lately, so this might be a colored perception.

@ChrisBeaumont : the basic code framework looks fine to me. I know it's incomplete in the cataloging functions, but I did think it would be good to include a pixel_catalog() function that has no metadata information required. If I have half a moment and you don't get to it, I'll try.

ChrisBeaumont commented 11 years ago

Thanks for the input! I should be able to put some code together in the next few days.

ChrisBeaumont commented 11 years ago

Items still to address:

ChrisBeaumont commented 11 years ago

@astrofrog what is the preferred way of bundling astropy Quanty arrays into astropy Tables? My frist attempt converted the Quantity array to a unit-less numpy array

ChrisBeaumont commented 11 years ago

This now has ppv_catalog and pp_catalog functionality. I've made the metadata optional to handle @low-sky's suggestion that it be possible to generate pixel quantities.

@astrofrog, can you take a look at what I've done so far, and let me know if you spot any big issues (more in terms of code organization as opposed to bugs)?

Here's some example use code:

from astropy.io import fits
from astrodendro import Dendrogram
from astrodendro.analysis import ppv_catalog
from astropy import units as u

from time import time

data = fits.getdata('L1448.13co.un.fits')

t0 = time()
d = Dendrogram.compute(data, min_value=1.5,
                       min_npix=30, min_delta=.5, verbose=True)
t1 = time()
print "Creating time = %i sec" % (t1 - t0)
print "Size of dendrogram: %i leaves" % len(d.leaves)

t0 = time()
c = ppv_catalog(d, {'bunit':u.K})
t1 = time()
print "Catalog time = %i sec" % (t1 - t0)
print c
print c['flux'].units

And the output I get

Generating dendrogram using 159,692 of 6,892,298 pixels (2% of data)
[========================================>] 100%
Creating time = 19 sec
Size of dendrogram: 112 leaves
WARNING: Converting Quantity object in units 'K' to a Numpy array [astropy.units.quantity]
WARNING: Converting Quantity object in units 'K' to a Python scalar [astropy.units.quantity]
Catalog time = 2 sec
     flux        luminosity    ...    sky_rad         vrms     
------------- ---------------- ... ------------- --------------
51.3346216679  0.0156374199982 ... 1.26823002342  1.30296145966
  102.6539253  0.0312701738558 ... 1.73544021225 0.923706752361
          ...              ... ...           ...            ...
K
astrofrog commented 11 years ago

This looks very nice! My one main comment is that things would be simplified a little and possibly more efficient if we used Astropy Tables in _make_catalog from the start - i.e. create the empty table with the columns and units, and add the structures row by row, then this doesn't require converting to a table in the hack-ish way you were mentioning. As a second best option, rather than build up a list of dicts, you could build up a dict of lists which I think would make the conversion to Astropy Tables easier (though the units will still need to be set manually for now).

astrofrog commented 11 years ago

Just another high-level comment/question - do you plan to add the ability to specify a WCS later? If that is done, it would allow all the output quantities to have the correct output units, and the scale to be set correctly. In fact, if one is looking at PPV or PPP, does it not make sense to require WCS?

astrofrog commented 11 years ago

I think you'll need to rebase due to a conflict in __init__.py.

ChrisBeaumont commented 11 years ago

Ok, I added some support for WCS (which is invoked to compute positions in world coordinates).

More can be done along these lines to auto-build the metadata dictionary from things like a fits header. However, since this can be hard to write robustly, I would like to add that functionality later -- it isn't too hard to build the metadata dicts manually at the moment

ChrisBeaumont commented 11 years ago

I don't want to require WCS, since it may not always be available or desired (e.g., for PPV/PP maps of simulations, a WCS object may be overkill)

astrofrog commented 11 years ago

This is looking good! So just to make sure I understand, at the moment the WCS isn't used to set the actual pixel scale (yet), correct?

ChrisBeaumont commented 11 years ago

That's correct, the user still has to provide dx and dv keywords to the metadata dictionary. The change here is that the metadata dict also accepts a wcs and wcs_origin key, to compute xcen, ycen, vcen in world coordinates.

dx and dv (and maybe some other things) could probably be auto-extracted from WCS and/or headers, but I think that should be a separate PR. The details of these things get hairy, and magic parsers can hide subtle issues or bugs.

ChrisBeaumont commented 11 years ago

I am ready for this to be merged once you are satisfied (and the travis tests finish)

astrofrog commented 11 years ago

Ok, sounds good - I agree, let's merge now and then we can always incrementally improve the WCS aspect.