Open e-koch opened 10 years ago
I'm following this up a bit. The problem occurs in compute
, not (just) prune
.
Data doublings (not duplicates of the original data, but true doublings) occur at these lines in compute
:
for i in np.argsort(data_values)[::-1]:
and
self._index()
These lines make 1x extra copy, or a little more, in TreeIndex.__init__
:
uniq, bins = np.unique(index_map, return_inverse=True)
self._index = tuple(n.ravel()[index] for n in
np.indices(index_map.shape))
TreeIndex
makes a couple data copies that can be small if the number of objects is small, but can be huge otherwise. The copies of uniq
and bins
don't get garbage collected because they are stored in TreeIndex.packed
my mwe:
from memory_profiler import profile
import numpy as np
from astrodendro import Dendrogram
@profile
def main():
array = np.random.randn(150,251,152)
d = Dendrogram.compute(array, verbose=True, min_value=1.5, min_delta=0.01, min_npix=3)
return d, array
d, array = main()
I'm not sure there's a workaround other than lazy computation of some of the indices. Maybe the dtypes can be forced to be lower-memory dtypes? But otherwise this may just be a documentation issue, and we should advise users that a lot of memory will be required if the number of independent structures is large.
The call to
self._index()
indendrogram.prune
significantly increases memory usage each time it's called.Here's a memory profile of
dendrogram.prune
:And for
self._index
: