Profile the imaging module

acviana commented 11 years ago

I ran a profile like this:

python -m cProfile -o 'mtpipeline_profile_1.pstats' ../Code/imaging/mtpipeline.py -filelist '/Users/viana/mtpipeline/archive/05215_mars/u2mi0a01t_c0m.fits' -no_cr_reject -no_astrodrizzle -reproc

This calls mtpipeline.py but turns off everything but the imaging module and only uses the linear mode. However it will run 4 times; one for each drizzling mode and once for each cosmic ray rejection mode.

acviana commented 11 years ago

Then we dive in with some IPython:

import pstats
p = pstats.Stats('mtpipeline_profile_1.pstats')
p.sort_stats('cumulative').print_stats(10)

Which returns:

Thu May 30 16:38:36 2013    mtpipeline_profile_1.pstats

         13439832 function calls (13183766 primitive calls) in 76.516 seconds

   Ordered by: cumulative time
   List reduced from 3822 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.022    0.022   76.519   76.519 ../Code/imaging/mtpipeline.py:8(<module>)
        1    0.012    0.012   72.075   72.075 ../Code/imaging/mtpipeline.py:87(run_mtpipeline)
        4    0.004    0.001   72.058   18.014 ../Code/imaging/run_trim.py:351(run_trim)
        4    0.001    0.000   67.516   16.879 ../Code/imaging/run_trim.py:307(saturated_clip)
        4    5.539    1.385   67.515   16.879 ../Code/imaging/run_trim.py:181(replace_by_weight)
        4    0.105    0.026   31.340    7.835 ../Code/imaging/display_tools.py:22(before_after)
        8    0.000    0.000   17.789    2.224 /usr/stsci/pyssgx/Python-2.7.3/lib/python2.7/site-packages/matplotlib/backends/backend_agg.py:394(draw)
   9448/8    0.083    0.000   17.775    2.222 /usr/stsci/pyssgx/Python-2.7.3/lib/python2.7/site-packages/matplotlib/artist.py:53(draw_wrapper)
        8    0.002    0.000   17.775    2.222 /usr/stsci/pyssgx/Python-2.7.3/lib/python2.7/site-packages/matplotlib/figure.py:815(draw)
       48    0.014    0.000   17.757    0.370 /usr/stsci/pyssgx/Python-2.7.3/lib/python2.7/site-packages/matplotlib/axes.py:1866(draw)

And just for reference here are the column definitions from the Python Docs:

ncalls: for the number of calls,
tottime: for the total time spent in the given function (and excluding time made in calls to sub-functions)
percall: is the quotient of tottime divided by ncalls
cumtime: is the cumulative time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions.
percall: is the quotient of cumtime divided by primitive calls
filename:lineno(function) provides the respective data of each function

acviana commented 11 years ago

This picture makes sense, but sorting by cumtime includes the subfunctions, it would be better to sort by tottime to see which individual functions are taking the most time:

In [6]: p.sort_stats('time').print_stats(10)

Which gives:

Thu May 30 16:38:36 2013    mtpipeline_profile_1.pstats

         13439832 function calls (13183766 primitive calls) in 76.516 seconds

   Ordered by: internal time
   List reduced from 3822 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   275484    7.824    0.000    7.824    0.000 {method 'mean' of 'numpy.ndarray' objects}
   275364    6.709    0.000   13.660    0.000 ../Code/imaging/run_trim.py:211(subarray)
        4    5.539    1.385   67.515   16.879 ../Code/imaging/run_trim.py:181(replace_by_weight)
   275577    4.258    0.000    4.258    0.000 {method 'sort' of 'numpy.ndarray' objects}
   275340    3.564    0.000   16.451    0.000 /usr/stsci/pyssgx/Python-2.7.3/lib/python2.7/site-packages/numpy/lib/function_base.py:2861(median)
   413103    3.368    0.000    3.368    0.000 {numpy.core.multiarray.array}
  1109645    3.112    0.000    3.112    0.000 {max}
  1111851    3.057    0.000    3.057    0.000 {min}
       28    2.009    0.072    2.009    0.072 {built-in method putdata}
   280434    1.499    0.000    1.499    0.000 {method 'flatten' of 'numpy.ndarray' objects}

acviana commented 11 years ago

Based on this it seems like the replace_by_weight function is the bottleneck.

acviana commented 11 years ago

Here is the current code:

def replace_by_weight(input_array, weight_array, output=False):
    '''
    Replace all the pixels that have a weight value of 0 with the local
    3x3 median. A copy of the image is created so that the values of
    the replaced pixels don't affect the medians of other pixels as 
    they are replaced. The subarray function is used to generate the 
    subarrays to prevent error at the array edge.

    The indices are set keeping in mind that a[0:2,0:2] returns a 2x2 
    array while a[0:3,0:3] returns a 3x3 array. Both arrays are 
    centered on (1,1).
    '''
    assert isinstance(input_array, N.ndarray), 'array must be numpy array'
    assert isinstance(weight_array, N.ndarray), 'array must be numpy array'
    output_array = copy.copy(input_array)
    saturated_indices = N.where(weight_array == 0)
    for index in zip(saturated_indices[0], saturated_indices[1]):
         output_array[index] = N.median(subarray(input_array, index[0] - 1, \
            index[0] + 2, index[1] - 1, index[1] + 2))
    if output != False:
        before_after(before_array = input_array, 
            after_array = output_array, 
            before_array_name = 'Input Data',
            after_array_name = '0-Weight Corrected',
            output = output,
            pause = False)  
    return output_array

embray commented 11 years ago

What does the subarray function look like?

embray commented 11 years ago

Would you be able to do this with a scipy.ndimage.filters.generic_filter?

acviana commented 11 years ago

@iguananaut, I'll look at scipy.ndimage.filters.generic_filter, I haven't used that before. Here is the subarray function:

def subarray(array, xmin, xmax, ymin, ymax):
    '''
    Returns a subarray.
    '''
    assert isinstance(array, N.ndarray), 'array must be numpy array'
    assert isinstance(xmin, int), 'xmin in subarray must be an int.'
    assert isinstance(xmax, int), 'xmax in subarray must be an int.'
    assert isinstance(ymin, int), 'ymin in subarray must be an int.'
    assert isinstance(ymax, int), 'ymax in subarray must be an int.'    
    assert xmin < xmax, 'xmin must be stictly less than xmax.'
    assert ymin < ymax, 'ymin must be stictly less than ymax.'
    output_array = array[max(xmin,0):min(xmax,array.shape[0]), \
        max(ymin,0):min(ymax,array.shape[1])]
    assert output_array.shape[0] == min(xmax,array.shape[0]) - max(xmin,0), \
        'Output shape is unexpected: ' + str(min(xmax,array.shape[0]) - max(xmin,0))
    assert output_array.shape[1] == min(ymax,array.shape[1]) - max(ymin,0), \
        'Output shape is unexpected: ' + str(min(ymax,array.shape[1]) - max(ymin,0))
    return output_array

embray commented 11 years ago

One thing about all those asserts, they should really probably be raising TypeError instead. And actually most of the type checks are a bit excessive...I would drop those and just leave the sanity checks (like xmin < ymin)

embray commented 11 years ago

Also I would use np.clip() to implement this.

STScI-Citizen-Science / MTPipeline

Profile the imaging module #48