Closed ChrisBeaumont closed 11 years ago
@ChrisBeaumont - just to make sure I understand, would the indices be stored internally as lists of Numpy arrays, and then converted when calling indices
to a tuple of Numpy arrays? Or would the storage be as it is right now, with the risk of having an expensive conversion to a tuple of arrays when calling indices
?
The TreeIndex
stores a tuple (whose length = dimension of input data) of 1D integer arrays (whose length = input_data.size). Each entry in the tuple corresponds to a location along the nth dimension. Each call to TreeIndex.indices involves extracting a contiguous slice form each of these arrays (cheap, since contiguous slices are views), packing those into a tuple (also cheap, since this doesn't copy the array either), and returning
Here's an example, looking at the leafiest
branch in a dendrogram:
from astrodendro import Dendrogram
from astrodendro.dendrogram import TreeIndex
from astropy.io import fits
im = fits.getdata('L1448.13co.un.fits')
d = Dendrogram.compute(im, min_data_value=2, min_npix=100, min_delta=1)
ti = TreeIndex(d)
s = d.trunk[1]
%timeit s.indices_all
%timeit ti.indices(s.idx, subtree=True)
1 loops, best of 3: 116 ms per loop
10000 loops, best of 3: 20.1 µs per loop
@ChrisBeaumont - ah, I see, that looks really handy & fast! In terms of Structure.indices
though, do you envisage Structure itself storing a list of numpy arrays, or a list of tuples?
I think all Structures
would store a (shared) reference to a TreeIndex object, and nothing else. Then indices
, values
, and the like just wrap around TreeIndex. The old index/value lists can even be deleted at that point, which would free up a lot of memory (since a python int is heavier than a numpy int)
The only awkward aspect of this is that the Structure
internals look very different when the dendrogram is computed, since they still need to build up the old index/value lists.
Does that make sense?
To be explicit, code would look something like this
@property
def indices(self):
return self._tree_index.indices(self.idx, subtree=False)
@ChrisBeaumont - ah, I think I see what you are saying - so essentially the idea is that once a dendrogram is generated, it becomes static and after that it is very efficient but immutable?
I agree that it's weird that the internal structure of the Dendrogram would change between before and after the end of the computation, but one possibility is to use different objects/classes altogether for the computation of the dendrogram and the object that is returned to the user. That is, compute
would use some kind of temporary Structure objects (say TempStructure), and then at the end returns a Dendrogram with actual Structures, which have the property that they are immutable. The two structure classes would not share much in fact, because the final one wouldn't have e.g. _add_pixel
, etc. and the initial one wouldn't have descendents
, ancestor
, and the caching would become a lot simpler for the final Structure
object since structures would then be immutable.
The downside of this is that there is a small overhead at the end of the Dendrogram computation to convert all TempStructure
objects to Structure
, but if that is much less than the compute time, then maybe this isn't a problem? Again, this might simplify things because we could then assume that the final Dendrogram and structures are immutable.
Another quick question for now - what is the advantage of using a tuple of 1d arrays as opposed to using a 2d array?
I think some of these aspects will become clearer to me once #17 is merged so that the diff becomes more sensible :)
I was thinking the same thing about two Structure classes. That would probably be cleaner, and I like the idea of removing methods like _add_pixel
once the dendrogram exists.
Building the index typically only takes a second or two, so it won't be a significant burden on compute
(it's basically one or two loops over the structure list).
I used the tuple of 1d arrays mainly just because that is the output format of np.where
. Also, I might just be confused, but I don't think you can use a 2D integer array as a fancy index to extract elements from the original data array:
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
diag1 = (np.array([0, 1, 2]), np.array([0, 1, 2]))
diag2 = np.array([ [0, 0], [1, 1], [2, 2]])
diag3 = np.array([ [0, 1, 2], [0, 1, 2]])
print "Expecting [1 5 9]"
print x[diag1]
print "Expecting [1 5 9]"
print x[diag2]
print "Expecting [1 5 9]"
print x[diag3]
Only the tuple works like I want. Am I missing something, or is there another advantage to a 2D numpy array?
@ChrisBeaumont - ah yes you are right about the 2d array, I wasn't thinking straight :)
@ChrisBeaumont - I've merged #17 so feel free to rebase this to get rid of the commits I already merged in. You don't technically have to, but it will make the diff easier to read....
Ok, done. I've added a step at the end of compute that builds a tree index, adds this to each structure, and modifies Structure to take advantage of the index.
I've also reworked 2 of the slowest tests that were bugging me (somewhat beyond the scope of this PR, and I can revert this if you object)
We still need to decide whether we should use a separate Structure class when using the tree index. This would have the benefit of freeing up RAM after dendrogram computation, since all of the old Structures with their lists of values/indices can be deleted. The downside would be some duplicated code, and an extra class.
A similar index could be added to tree index which efficiently returns descendant nodes. I'm not sure how big of a speed bottleneck that is, however. We can always add it later if needed
@ChrisBeaumont - I'm all for trying to have separate classes - I actually think most of the methods won't be duplicated because a lot are used either only before or only after the computation. Maybe WorkStructure
could be the initial one and then Structure
the final one.
@ChrisBeaumont - just reviewed this, and it's very clever! The Python 3.2 tests are failing because in Python 3, zip doesn't return a list, it returns an iterator. Just use `list(zip(...))`` instead.
Do you want to keep the idea of separating the two structure classes for a separate pull request, or do you want to do it here?
Ok, I think the new commit will fix the broken tests.
I took a first pass at splitting Structure up, and it was a non-trivial modification. Since everything works and the change won't affect the interface, I vote for tackling that in a new PR
@astrofrog I am happy with the state of this. If everything looks ok to you, feel free to merge
Yes, it looks good to me - merging!
This is not ready to merge, and should also wait for #17 (which this PR is rebased against)
I've added a
TreeIndex
class for to more efficiently extract the locations of each structure.At the moment, the usage is as follows:
inds
is a tuple of numpy arrays, similar to the output of np.where. They can be used as a fancy index todata_array.
TreeIndex
stores a flattened and permuted version ofnp.indices(data_array.shape)
, such that every structure (with or without subtrees) corresponds to a single contiguous slice of this array. This means that its space requirement is ~len(data_array.shape) * data_array.size. Theindices
method is more or less a single array slice, so it should be fast.@astrofrog, I'd like to change the return value of
Structure.indices
andStructure.indices_all
to return a tuple of numpy arrays, instead of a list of tuples. Then, at the end ofDendrogram.compute
, I'd like to compute a TreeIndex, and store this index with each structure. This will let us use theTreeIndex
to trivially implementStructure.indices
, without recursion and in O(1) time.Are you ok with this interface change to
Structure.indices
?I've also trimmed dendrogram.index_map, as discussed in #18