tomr-stargazer commented 10 years ago

If anyone has a chance to help me understand the following performance issues I've been running into, I'd be grateful.

On one particular machine, I've been running into issues when calculating ppv_catalog on dendrogrammed data of modest to large size. (In particular, the file I'm running into trouble on is hosted here: http://www.cfa.harvard.edu/rtdc/CO/NumberedRegions/DHT21/DHT21_Taurus_mom.fits and is around 10 MB in size)

In [6]: d, catalog, x, y = perseus_analysis() # perseus_analysis loads the data, computes a dendrogram, and then calls ppv_catalog on the dendrogram using appropriate metadata.
loading data: ...
transposing, downsampling, and unit-converting data: ...
computing dendrogram: ...
Generating dendrogram using 307,278 of 5,296,671 pixels (5% of data)
[========================================>] 100%
# the above dendrogram computation takes 30 - 60 seconds using the following parameters:
#  min_value=0.01
#  min_delta=0.005
#  min_npix=20

# then the code hangs while running ppv_catalog

Based on the tracebacks I'm seeing, the resulting dendrogram seems to have around ~~100,000 indexed structures.~~ The dendrogram has 1029 structures, and for some of those structures, the struct.indices has over 100,000 items.

When I hit ctrl-c after about five or so minutes, the interrupt typically hits one of the following lines of code (below examples chosen as illustrative) :

# line number one
> /Users/tsrice/Documents/Code/astrodendro/astrodendro/analysis.py(65)mom1()
     64         m0 = self.mom0()
---> 65         return [np.nansum(i * self.values) / m0 for i in self.indices]
     66

# line number two
/Users/tsrice/Documents/Code/astrodendro/astrodendro/analysis.pyc in mom2(self)
     77
     78         for i in range(nd):
---> 79             result[i, i] = np.nansum(v * zyx[i] ** 2)
     80             for j in range(i + 1, nd):
     81                 result[i, j] = result[j, i] = np.nansum(v * zyx[i] * zyx[j])

# line number three
/Users/tsrice/Documents/Code/astrodendro/astrodendro/analysis.pyc in mom2(self)
     79             result[i, i] = np.nansum(v * zyx[i] ** 2)
     80             for j in range(i + 1, nd):
---> 81                 result[i, j] = result[j, i] = np.nansum(v * zyx[i] * zyx[j])
     82
     83         return result

The versions of my python packages:

Python 2.7.6 |Anaconda 2.0.0 (x86_64)| (default, May 27 2014, 14:58:54)
IPython 2.1.0 -- An enhanced Interactive Python.

In [4]: np.__version__
Out[4]: '1.8.1'

In [5]: astropy.__version__
Out[5]: '0.3.2'

Some questions:

Is this datacube too big to reasonably compute a catalog for? (I hope not!)
Are there better ways to profile the performance in situations like this, so I can better understand the code pinch points?
Is it possible or desirable for us to improve the performance of the analysis code?

ChrisBeaumont commented 10 years ago

Is this datacube too big to reasonably compute a catalog for? (I hope not!)

No, this is actually on the small end of dataset sizes

Are there better ways to profile the performance in situations like this, so I can better understand the code pinch points?

One way is to run python -m cProfile -s cumtime script.py. This runs a script normally, and then prints out timing information for every function, sorted by the cumulative time spent in each function. It can identify hotspots

Another useful tool is the line profiler. With this you can add a @profile decorator on top of specific functions, and track timing information line-by-line

Is it possible or desirable for us to improve the performance of the analysis code?

Yes, something is fishy here. Using your dendrogram settings, my computer builds the catalog in 90 seconds (are you running out of memory? My python process takes up about 300MB during computation)

Here are the most expensive calls on my machine

         26989682 function calls (26720625 primitive calls) in 125.400 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.007    0.007  125.402  125.402 test.py:1(<module>)
        1    0.000    0.000   92.546   92.546 analysis.py:628(ppv_catalog)
        1    0.082    0.082   92.546   92.546 analysis.py:578(_make_catalog)
574512/461322    6.579    0.000   86.173    0.000 {getattr}
    28812    0.627    0.000   53.111    0.002 analysis.py:84(mom2_along)
    28812   10.195    0.000   51.202    0.002 analysis.py:66(mom2)
    12348    0.013    0.000   42.839    0.003 analysis.py:593(<genexpr>)
   343686    1.438    0.000   36.635    0.000 nanfunctions.py:424(nansum)
        1   17.376   17.376   31.633   31.633 dendrogram.py:88(compute)
    14406    0.130    0.000   28.585    0.002 analysis.py:367(_sky_paxes)
    14406    0.178    0.000   28.419    0.002 analysis.py:128(projected_paxes)
     6174    0.074    0.000   23.780    0.004 analysis.py:283(major_sigma)
     6174    0.074    0.000   23.745    0.004 analysis.py:296(minor_sigma)
    34986    4.457    0.000   19.705    0.001 analysis.py:61(mom1)
   343686    8.021    0.000   19.189    0.000 nanfunctions.py:29(_replace_nan)
     2058    8.431    0.004   17.791    0.009 analysis.py:444(area_exact)
     2058    0.023    0.000   16.218    0.008 analysis.py:318(area_ellipse)
     2058    0.025    0.000   16.193    0.008 analysis.py:309(radius)
  1107952   12.070    0.000   12.070    0.000 {method 'reduce' of 'numpy.ufunc' 
objects}

So maybe mom2 could be further optimized. There are also libraries like Bottleneck that could speed up the NaN arithmetic by about a factor of 2 -- @keflavich has done this for the radio-cube project. But it seems like something else is going wrong with your setup right now...?

keflavich commented 10 years ago

Just eyeballing the input parameters... isn't the min_delta a bit small? Also, that cube loads funny in ds9 - is there something wrong with the cube? It looks like it has rows of blanks everywhere.

ChrisBeaumont commented 10 years ago

It's telling that mom2 gets called 28K times, even though the final catalog only has ~1K structures -- suggests that memoizing would speed things up. I'll open a PR on this

tomr-stargazer commented 10 years ago

@keflavich the dimensions of the cube seem to be in an unusual order, and the datacube isn't sampled uniformly, which might be what you're seeing. From the FITS header:

CTYPE1  = 'VELO-LSR'            / axis 1 coord type
CTYPE2  = 'GLON-CAR'            / axis 2 coord type
CTYPE3  = 'GLAT-CAR'            / axis 3 coord type
COMMENT Axis units: v(km/s) x LII(deg) x BII(deg)

In all my analysis code I transpose the cube and header before doing much else.

keflavich commented 10 years ago

@tomr-stargazer - actually, everything is fine, just the first frame (fits.getdata('DHT21_Taurus_mom.fits')[0,:,:]) is missing every other line. It's probably necessary to mask out the regions that are chunked up like that, though, no?

ChrisBeaumont commented 10 years ago

81d6c2b64f403a207e1b

and python -m cProfile -s cumtime test.py takes 68 sec, and prints

         18296810 function calls (18032900 primitive calls) in 68.825 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.003    0.003   68.826   68.826 test.py:1(<module>)
        1    0.000    0.000   36.020   36.020 analysis.py:667(ppv_catalog)
        1    0.078    0.078   36.020   36.020 analysis.py:617(_make_catalog)
        1   17.172   17.172   31.268   31.268 dendrogram.py:89(compute)
561170/461357    6.429    0.000   29.670    0.000 {getattr}
     2058    8.370    0.004   17.697    0.009 analysis.py:482(area_exact)
    12348    0.010    0.000   13.380    0.001 analysis.py:632(<genexpr>)
    39764    9.223    0.000    9.223    0.000 {zip}
   307278    5.397    0.000    5.397    0.000 dendrogram.py:325(neighbours)
     1073    0.111    0.000    5.017    0.005 table.py:1253(_rebuild_table_colum
n_views)
     1029    0.027    0.000    4.961    0.005 table.py:2297(add_row)
    26328    0.039    0.000    3.988    0.000 table.py:64(wrapper)
    26328    0.170    0.000    3.946    0.000 table.py:598(__new__)
209136/104816    0.632    0.000    3.687    0.000 copy.py:145(deepcopy)
     6174    0.061    0.000    2.720    0.000 analysis.py:318(major_sigma)
    13146    0.087    0.000    2.555    0.000 table.py:654(copy)
45276/37044    0.118    0.000    2.513    0.000 analysis.py:29(wrapper)
    14406    0.107    0.000    2.384    0.000 analysis.py:404(_sky_paxes)
82993/42011    0.150    0.000    2.293    0.000 structure.py:161(values)
     4116    0.079    0.000    2.242    0.001 analysis.py:112(mom2_along)

tomr-stargazer commented 10 years ago

Thanks for the performance upgrade @ChrisBeaumont!

As I mentioned, this is a different machine than my usual one, and I think I've traced most of the problem to a pathological interaction between RAM and hard drive space which was slowing everything down. (Namely, even though the machine has 8 GB RAM, the hard disk has less than 10 GB of free space, and something strange was eating it up as a paging area while making everything slow. There's a longer story behind why I'm using this machine, but my usual one had a hardware failure early last week so this is a backup.)

I ran the profiled test code python -m cProfile -s cumtime dendro_test.py > test_output.txt (after pulling the performance upgrade in #115) and got this output:

(link to full output, abridged version below:)

         18296411 function calls (18032534 primitive calls) in 213.280 seconds

   Ordered by: cumulative time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.032    0.032  213.283  213.283 dendro_test.py:1(<module>)
        1    0.000    0.000  114.646  114.646 analysis.py:667(ppv_catalog)
        1    0.204    0.204  114.646  114.646 analysis.py:617(_make_catalog)
        1   50.426   50.426   94.392   94.392 dendrogram.py:89(compute)
561173/461360   18.197    0.000   93.848    0.000 {getattr}
     2058   26.986    0.013   56.756    0.028 analysis.py:482(area_exact)
    12348    0.040    0.000   40.533    0.003 analysis.py:632(<genexpr>)
    39759   29.727    0.001   29.727    0.001 {zip}
     1029    0.323    0.000   15.498    0.015 table.py:2297(add_row)
     1073    0.397    0.000   15.267    0.014 table.py:1253(_rebuild_table_column_views)
   307278   14.880    0.000   14.880    0.000 dendrogram.py:325(neighbours)
    26328    0.114    0.000   11.867    0.000 table.py:64(wrapper)
    26328    0.389    0.000   11.746    0.000 table.py:598(__new__)
209136/104816    2.057    0.000   11.519    0.000 copy.py:145(deepcopy)
     6174    0.206    0.000   11.341    0.002 analysis.py:318(major_sigma)
45276/37044    0.399    0.000   10.788    0.000 analysis.py:29(wrapper)
    14406    0.319    0.000   10.413    0.001 analysis.py:404(_sky_paxes)
     1029    0.080    0.000    9.945    0.010 analysis.py:158(projected_paxes)
     4116    0.173    0.000    9.784    0.002 analysis.py:112(mom2_along)
     1029    2.331    0.002    8.816    0.009 analysis.py:93(mom2)
82993/42011    0.628    0.000    7.922    0.000 structure.py:161(values)
    13146    0.198    0.000    7.162    0.001 table.py:654(copy)
    52744    0.264    0.000    6.233    0.000 metadata.py:126(__set__)
    43390    0.507    0.000    5.632    0.000 pruning.py:69(result)
    35999    1.216    0.000    5.488    0.000 structure.py:280(_fill_footprint)
    52160    0.691    0.000    5.390    0.000 copy.py:306(_reconstruct)
    44569    0.094    0.000    5.105    0.000 {all}
   127762    0.308    0.000    5.034    0.000 pruning.py:70(<genexpr>)
     1029    4.896    0.005    5.004    0.005 dendrogram.py:674(values)
...
# many more lines which took less than 5 seconds

also:

➜  81d6c2b64f403a207e1b git:(master) ✗ wc -l test_output.txt
    5772 test_output.txt

but the first 3075 lines of that should be ignored.

ChrisBeaumont commented 10 years ago

Ok, this looks sensible for a slower computer, I'm going to close this

dendrograms / astrodendro

Performance issues in astrodendro.analysis #114