dendrograms / astrodendro

Generate a dendrogram from a dataset
https://dendrograms.readthedocs.io/
Other
37 stars 38 forks source link

Efficient indexing #20

Closed ChrisBeaumont closed 11 years ago

ChrisBeaumont commented 11 years ago

This is not ready to merge, and should also wait for #17 (which this PR is rebased against)

I've added a TreeIndex class for to more efficiently extract the locations of each structure.

At the moment, the usage is as follows:

d = Dendrogram.compute(data_array)
ti = TreeIndex(d)
inds = ti.indices(d.trunk[0].idx, subtree=True)

inds is a tuple of numpy arrays, similar to the output of np.where. They can be used as a fancy index to data_array.

TreeIndex stores a flattened and permuted version of np.indices(data_array.shape), such that every structure (with or without subtrees) corresponds to a single contiguous slice of this array. This means that its space requirement is ~len(data_array.shape) * data_array.size. The indices method is more or less a single array slice, so it should be fast.

@astrofrog, I'd like to change the return value of Structure.indices and Structure.indices_all to return a tuple of numpy arrays, instead of a list of tuples. Then, at the end of Dendrogram.compute, I'd like to compute a TreeIndex, and store this index with each structure. This will let us use the TreeIndex to trivially implement Structure.indices, without recursion and in O(1) time.

Are you ok with this interface change to Structure.indices?

I've also trimmed dendrogram.index_map, as discussed in #18

astrofrog commented 11 years ago

@ChrisBeaumont - just to make sure I understand, would the indices be stored internally as lists of Numpy arrays, and then converted when calling indices to a tuple of Numpy arrays? Or would the storage be as it is right now, with the risk of having an expensive conversion to a tuple of arrays when calling indices?

ChrisBeaumont commented 11 years ago

The TreeIndex stores a tuple (whose length = dimension of input data) of 1D integer arrays (whose length = input_data.size). Each entry in the tuple corresponds to a location along the nth dimension. Each call to TreeIndex.indices involves extracting a contiguous slice form each of these arrays (cheap, since contiguous slices are views), packing those into a tuple (also cheap, since this doesn't copy the array either), and returning

Here's an example, looking at the leafiest branch in a dendrogram:

from astrodendro import Dendrogram
from astrodendro.dendrogram import TreeIndex
from astropy.io import fits

im = fits.getdata('L1448.13co.un.fits')
d = Dendrogram.compute(im, min_data_value=2, min_npix=100, min_delta=1)
ti = TreeIndex(d)

s = d.trunk[1]
%timeit s.indices_all
%timeit ti.indices(s.idx, subtree=True)
1 loops, best of 3: 116 ms per loop
10000 loops, best of 3: 20.1 µs per loop
astrofrog commented 11 years ago

@ChrisBeaumont - ah, I see, that looks really handy & fast! In terms of Structure.indices though, do you envisage Structure itself storing a list of numpy arrays, or a list of tuples?

ChrisBeaumont commented 11 years ago

I think all Structures would store a (shared) reference to a TreeIndex object, and nothing else. Then indices, values, and the like just wrap around TreeIndex. The old index/value lists can even be deleted at that point, which would free up a lot of memory (since a python int is heavier than a numpy int)

The only awkward aspect of this is that the Structure internals look very different when the dendrogram is computed, since they still need to build up the old index/value lists.

Does that make sense?

ChrisBeaumont commented 11 years ago

To be explicit, code would look something like this

@property
def indices(self):
    return self._tree_index.indices(self.idx, subtree=False)
astrofrog commented 11 years ago

@ChrisBeaumont - ah, I think I see what you are saying - so essentially the idea is that once a dendrogram is generated, it becomes static and after that it is very efficient but immutable?

I agree that it's weird that the internal structure of the Dendrogram would change between before and after the end of the computation, but one possibility is to use different objects/classes altogether for the computation of the dendrogram and the object that is returned to the user. That is, compute would use some kind of temporary Structure objects (say TempStructure), and then at the end returns a Dendrogram with actual Structures, which have the property that they are immutable. The two structure classes would not share much in fact, because the final one wouldn't have e.g. _add_pixel, etc. and the initial one wouldn't have descendents, ancestor, and the caching would become a lot simpler for the final Structure object since structures would then be immutable.

The downside of this is that there is a small overhead at the end of the Dendrogram computation to convert all TempStructure objects to Structure, but if that is much less than the compute time, then maybe this isn't a problem? Again, this might simplify things because we could then assume that the final Dendrogram and structures are immutable.

Another quick question for now - what is the advantage of using a tuple of 1d arrays as opposed to using a 2d array?

I think some of these aspects will become clearer to me once #17 is merged so that the diff becomes more sensible :)

ChrisBeaumont commented 11 years ago

I was thinking the same thing about two Structure classes. That would probably be cleaner, and I like the idea of removing methods like _add_pixel once the dendrogram exists.

Building the index typically only takes a second or two, so it won't be a significant burden on compute (it's basically one or two loops over the structure list).

I used the tuple of 1d arrays mainly just because that is the output format of np.where. Also, I might just be confused, but I don't think you can use a 2D integer array as a fancy index to extract elements from the original data array:

import numpy as np

x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

diag1 = (np.array([0, 1, 2]), np.array([0, 1, 2]))
diag2 = np.array([ [0, 0], [1, 1], [2, 2]])
diag3 = np.array([ [0, 1, 2], [0, 1, 2]])

print "Expecting [1 5 9]"
print x[diag1]

print "Expecting [1 5 9]"
print x[diag2]

print "Expecting [1 5 9]"
print x[diag3]

Only the tuple works like I want. Am I missing something, or is there another advantage to a 2D numpy array?

astrofrog commented 11 years ago

@ChrisBeaumont - ah yes you are right about the 2d array, I wasn't thinking straight :)

astrofrog commented 11 years ago

@ChrisBeaumont - I've merged #17 so feel free to rebase this to get rid of the commits I already merged in. You don't technically have to, but it will make the diff easier to read....

ChrisBeaumont commented 11 years ago

Ok, done. I've added a step at the end of compute that builds a tree index, adds this to each structure, and modifies Structure to take advantage of the index.

I've also reworked 2 of the slowest tests that were bugging me (somewhat beyond the scope of this PR, and I can revert this if you object)

We still need to decide whether we should use a separate Structure class when using the tree index. This would have the benefit of freeing up RAM after dendrogram computation, since all of the old Structures with their lists of values/indices can be deleted. The downside would be some duplicated code, and an extra class.

A similar index could be added to tree index which efficiently returns descendant nodes. I'm not sure how big of a speed bottleneck that is, however. We can always add it later if needed

astrofrog commented 11 years ago

@ChrisBeaumont - I'm all for trying to have separate classes - I actually think most of the methods won't be duplicated because a lot are used either only before or only after the computation. Maybe WorkStructure could be the initial one and then Structure the final one.

astrofrog commented 11 years ago

@ChrisBeaumont - just reviewed this, and it's very clever! The Python 3.2 tests are failing because in Python 3, zip doesn't return a list, it returns an iterator. Just use `list(zip(...))`` instead.

Do you want to keep the idea of separating the two structure classes for a separate pull request, or do you want to do it here?

ChrisBeaumont commented 11 years ago

Ok, I think the new commit will fix the broken tests.

I took a first pass at splitting Structure up, and it was a non-trivial modification. Since everything works and the change won't affect the interface, I vote for tackling that in a new PR

ChrisBeaumont commented 11 years ago

@astrofrog I am happy with the state of this. If everything looks ok to you, feel free to merge

astrofrog commented 11 years ago

Yes, it looks good to me - merging!