Help verifying performance claims

jdmoorman commented 3 years ago

I am having trouble verifying that the low RAM and streaming variants of SPOD use less RAM than the low storage variant. I am generating synthetic data as in methods_comparison_file.py:

# Let's create some 2D syntetic data
# and store them into a variable called p
variables = ['p']
x1 = np.linspace(0,10,100)
x2 = np.linspace(0, 5, 50)
xx1, xx2 = np.meshgrid(x1, x2)
t = np.linspace(0, 200, 1000)
s_component = np.sin(xx1 * xx2) + np.cos(xx1)**2 + np.sin(0.1*xx2)
# s_component = s_component.T
t_component = np.sin(0.1 * t)**2 + np.cos(t) * np.sin(0.5*t)
p = np.empty((t_component.shape[0],)+s_component.shape)
for i, t_c in enumerate(t_component):
    p[i] = s_component * t_c

# We now save the data into netCDF format
ds = xr.Dataset(
        {"p": (("time", "x1", "x2"), p)},
        coords={
            "x1": x2,
            "x2": x1,
            "time": t,
        },
    )
ds.to_netcdf("data.nc")

In a separate script, I make a data handler

variables = ['p']
ds = xr.open_dataset("data.nc")

# Reader for netCDF
def read_data_netCDF(data, t_0, t_end, variables):
    if t_0 == t_end: ti = [t_0]
    else           : ti = np.arange(t_0,t_end)
    X = np.empty([len(ti), 50, 100, len(variables)])
    for _,var in enumerate(variables):
        X = np.array(ds[var].isel(time=ti))
    return X

define the parameters

# -- required parameters
params['dt'          ] = 1                  # data time-sampling
params['nt'          ] = 1000               # number of time snapshots (we consider all data)
params['xdim'        ] = 2                  # number of spatial dimensions (longitude and latitude)
params['nv'          ] = len(variables)     # number of variables
params['n_FFT'       ] = 100                # length of FFT blocks (100 time-snapshots)
params['n_freq'      ] = params['n_FFT'] / 2 + 1            # number of frequencies
params['n_overlap'   ] = np.ceil(params['n_FFT'] * 0 / 100) # dimension block overlap region
params['mean'        ] = 'blockwise'                        # type of mean to subtract to the data
params['normalize'   ] = False                              # normalization of weights by data variance
params['savedir'     ] = os.path.join(CWD, 'results', 'simple_test') # folder where to save results

# -- optional parameters
params['weights']      = None # if set to None, no weighting (if not specified, Default is None)
params['savefreqs'   ] = np.arange(0,params['n_freq']) # frequencies to be saved
params['n_modes_save'] = 3      # modes to be saved
params['normvar'     ] = False  # normalize data by data variance
params['conf_level'  ] = 0.95   # calculate confidence level
params['savefft'     ] = True   # save FFT blocks to reuse them in the future (saves time)

and run the algorithm e.g.

# Finally, we can try the streaming algorithm
spod_st = SPOD_streaming(
    X=os.path.join(CWD,'data.nc'),
    params=params,
    data_handler=read_data_netCDF,
    variables=variables)
spod_st.fit()

Using the python package memory-profiler, I get plots of the memory usage over time for each algorithm (below). From the plots, it seems the difference between low RAM and low storage is not significant, and the streaming variant actually requires more RAM than the low storage variant.

Low RAM

file_low_ram

Low Storage

file_low_storage

Streaming

file_streaming

jdmoorman commented 3 years ago

openjournals/joss-reviews#2862

mengaldo commented 3 years ago

@jdmoorman thank you for noticing this behavior! I will get back to you on this one, in the next few days.

mengaldo commented 3 years ago

@jdmoorman Thank you for this one. I took a closer look - I believe that the not noticeable difference in memory between low_storage and low_ram is due to the fact that if you ran the two one after the other, low_storage is reusing precomputed blocks. Running the same test as you, but with one order of magnitude larger spatial dimension gives me:

low_ram 10499.449 MiB
low_storage 14325.508 MiB

Regarding the streaming version, running the larger problem above gives me indeed a larger memory use. I will reflect that in the manuscript and documentation and remove the claim.

jdmoorman commented 3 years ago

If the streaming version of the algorithm has a larger memory footprint than the other versions, is there any situation where the streaming version should be used?

mengaldo commented 3 years ago

@jdmoorman Yes, probably it should only stay in a development branch, as the footprint is larger; I can remove it from the paper. Although, it is an instructive one to have, as it is a different implementation. Let me know

MathEXLab / PySPOD