SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
25 stars 11 forks source link

Potential speed-up for creating spot column in atl03sp #388

Closed alma-pi closed 2 months ago

alma-pi commented 2 months ago

In sliderule/clients/python/sliderule/icesat2.py, the 'spot' column is calculated using geopandas.apply and the __calcspot function. Using pandas.Series.map here instead should be much faster.

atl03['spot'] = atl03.apply(lambda row: sliderule.icesat2.__calcspot(row["sc_orient"], row["track"], row["pair"]), axis=1)

For a granule of about 2 million photons, this takes about 19s. Using a dictionary and pandas.Series.map takes less than 2s:

# Create dictionary mapping (sc_orient, track, pair) to spot
map_spot = {(0,1,0): 1,
            (0,1,1): 2,
            (0,2,0): 3,
            (0,2,1): 4,
            (0,3,0): 5,
            (0,3,1): 6,
            (1,1,0): 6,
            (1,1,1): 5,
            (1,2,0): 4,
            (1,2,1): 3,
            (1,3,0): 2,
            (1,3,1): 1,}

tmp = pd.Series(zip(atl03['sc_orient'], atl03['track'], atl03['pair']))
atl03['spot'] = tmp.map(map_spot).values
del tmp

Pandas map function returns NaNs in case of missing keys.
It's also possible to change the current function to accept a tuple as input. Along the lines of:

def __calcspot(input_tuple):
    sc_orient, track, pair = input_tuple
    [...]

tmp = pd.Series(zip(atl03['sc_orient'], atl03['track'], atl03['pair']))
atl03['spot'] = tmp.map(__calcspot).values
jpswinski commented 2 months ago

Thanks @alma-pi! This is a great optimization. I'll update this issue when we get it into the code, and it should be out with the next release.

jpswinski commented 2 months ago

@alma-pi this change you outlined has been made and pushed. It didn't make it into this past release, but will come out with the next release.

After implementing your change, I saw a significant speed up in atl03 processing calls; on the order of ~60 second requests going to ~35 seconds.