dendrograms / astrodendro

Generate a dendrogram from a dataset
https://dendrograms.readthedocs.io/
Other
37 stars 38 forks source link

Excessive memory usage from using prune #119

Open e-koch opened 10 years ago

e-koch commented 10 years ago

The call to self._index() in dendrogram.prune significantly increases memory usage each time it's called.

Here's a memory profile of dendrogram.prune:

Line #    Mem usage    Increment   Line Contents
================================================
   485    839.9 MiB      0.0 MiB       @profile
   486                                 def prune(self, min_delta=0, min_npix=0, is_independent=None):
   487                                     '''
   488                                     Prune a dendrogram after it has been computed.
   489                             
   490                                     Parameters
   491                                     ----------
   492                                     min_delta : float, optional
   493                                         The minimum height a leaf has to have in order to be considered an
   494                                         independent entity.
   495                                     min_npix : int, optional
   496                                         The minimum number of pixels/values needed for a leaf to be considered
   497                                         an independent entity.
   498                                     is_independent : function or list of functions, optional
   499                                         A custom function that can be specified that will determine if a
   500                                         leaf can be treated as an independent entity. The signature of the
   501                                         function should be ``func(structure, index=None, value=None)``
   502                                         where ``structure`` is the structure under consideration, and
   503                                         ``index`` and ``value`` are optionally the pixel that is causing
   504                                         the structure to be considered for merging into/attaching to the
   505                                         tree.
   506                             
   507                                         If multiple functions are provided as a list, they
   508                                         are all applied when testing for independence.
   509                                     '''
   510                             
   511                                     # If params set to zero, set equal to value form self.params
   512    839.9 MiB      0.0 MiB           if min_delta == 0:
   513                                         min_delta = self.params["min_delta"]
   514    839.9 MiB      0.0 MiB           if min_npix == 0:
   515    839.9 MiB      0.0 MiB               min_npix = self.params["min_npix"]
   517                                     # Check if params are too restrictive.                                                                                                                                                                                                                                  [174/1848]
   518    839.9 MiB      0.0 MiB           if min_delta < self.params["min_delta"]:
   519                                         warnings.warn("New min_delta (%s) is less than the current min_delta \
   520                                                        (%s). No leaves can be pruned." \
   521                                                        % (min_delta, self.params["min_delta"]))
   522                                     else:  # Update params
   523    839.9 MiB      0.0 MiB               self.params["min_delta"] = min_delta
   524                             
   525    839.9 MiB      0.0 MiB           if min_npix < self.params["min_npix"]:
   526                                         warnings.warn("New min_npix (%s) is less than the current min_npix \
   527                                                        (%s). No leaves can be pruned." \
   528                                                        % (min_npix, self.params["min_npix"]))
   529                                     else:  # Updates params
   530    839.9 MiB      0.0 MiB               self.params["min_npix"] = min_npix
   531                             
   532    839.9 MiB      0.0 MiB           tests = [pruning.min_delta(min_delta),
   533    839.9 MiB      0.0 MiB                    pruning.min_npix(min_npix)]
   534    839.9 MiB      0.0 MiB           if is_independent is not None:
   535                                         if isinstance(is_independent, Iterable):
   536                                             tests.extend(is_independent)
   537                                         else:
   538                                             tests.append(is_independent)
   539    839.9 MiB      0.0 MiB           is_independent = pruning.all_true(tests)
   540                             
   541    839.9 MiB      0.0 MiB           keep_structures = self._structures_dict.copy()
   542                             
   543                                     # Continue until there are no more leaves to prune.
   544    840.0 MiB      0.0 MiB           for struct in _to_prune(self, keep_structures, is_independent):
   545                                             # merge struct
   546    840.0 MiB      0.0 MiB                   parent = struct.parent
   547    840.0 MiB      0.0 MiB                   siblings = parent.children
   548                                             # If leaf has one other sibling, merge both into the parent
   549    840.0 MiB      0.0 MiB                   if len(siblings) == 2:
   550    840.0 MiB      0.0 MiB                       merge = copy.copy(siblings)
   551                             
   552                                             # If leaf has multiple siblings, merge leaf into parent
   553    839.9 MiB     -0.0 MiB                   elif len(siblings) > 2:
   554    839.9 MiB      0.0 MiB                       merge = [struct]
   555                             
   556                                             # Merge structures into the parent
   557    840.0 MiB      0.0 MiB                   for m in merge:
   558    840.0 MiB      0.0 MiB                       _merge_with_parent(m, self.index_map)
   559                             
   560                                                 # Remove this structure
   561    840.0 MiB      0.0 MiB                       del keep_structures[m.idx]
   562                             
   563                                     # Create trunk from objects with no ancestors
   564    840.0 MiB      0.0 MiB           _make_trunk(self, keep_structures, is_independent)
   565                             
   566                                     # Save a list of all structures accessible by ID
   567    840.0 MiB      0.0 MiB           self._structures_dict = keep_structures
   568                             
   569   1237.6 MiB    397.7 MiB           self._index()  # XXX check if this is OK with non-packed idx values
   570                             
   571   1237.6 MiB      0.0 MiB           return self

And for self._index:

Line #    Mem usage    Increment   Line Contents
================================================
   320    840.0 MiB      0.0 MiB       @profile
   321                                 def _index(self):
   322                                     # add dendrogram index
   323   1237.6 MiB    397.7 MiB           ti = TreeIndex(self)
   324                             
   325   1237.6 MiB      0.0 MiB           for s in six.itervalues(self._structures_dict):
   326   1237.6 MiB      0.0 MiB               s._tree_index = ti
keflavich commented 3 months ago

I'm following this up a bit. The problem occurs in compute, not (just) prune.

Data doublings (not duplicates of the original data, but true doublings) occur at these lines in compute:

for i in np.argsort(data_values)[::-1]:

and

self._index()

These lines make 1x extra copy, or a little more, in TreeIndex.__init__:

uniq, bins = np.unique(index_map, return_inverse=True)
self._index = tuple(n.ravel()[index] for n in
                    np.indices(index_map.shape))

TreeIndex makes a couple data copies that can be small if the number of objects is small, but can be huge otherwise. The copies of uniq and bins don't get garbage collected because they are stored in TreeIndex.packed

my mwe:

from memory_profiler import profile

import numpy as np
from astrodendro import Dendrogram

@profile
def main():
    array = np.random.randn(150,251,152)
    d = Dendrogram.compute(array, verbose=True, min_value=1.5, min_delta=0.01, min_npix=3)
    return d, array

d, array = main()

I'm not sure there's a workaround other than lazy computation of some of the indices. Maybe the dtypes can be forced to be lower-memory dtypes? But otherwise this may just be a documentation issue, and we should advise users that a lot of memory will be required if the number of independent structures is large.