add_epochs for large number of initial epochs

kathapand commented 2 years ago

I need to add a large number of epochs (> 2,500) to a SpatiotemporalAnalysis object for the extraction of 4D objects-by-change. Adding them one by one takes too much time, due to rearranging of the array (known behavior): at the 500th epoch rearranging takes already 180 s.

As all epochs are already available (no online acquisition and analysis), I can add all epochs, by compiling a list of all generated epochs:

# loop over epochs and add (except the reference epoch)
epoch_batch = []
for idx, row in df_timeseries.iterrows():
    fname_cloud = row['fname_timestring']
    if fname_cloud == fname_reference_cloud:
        continue
    subdir_daystring = fname_reference_cloud.split('_')[0]
    f_cloud = f'{data_dir}/{subdir_daystring}/{fname_reference_cloud}.laz'

    trafomat = read_corresponding_trafofile(f_cloud)
    cloud = read_las(f_cloud, trafomat = trafomat)

    epoch = py4dgeo.epoch.as_epoch(cloud)
    epoch.timestamp = row['timestamp']

    epoch_batch.append(epoch)

    #if idx > 5:
    #    break

analysis.add_epochs(*epoch_batch)

kathapand commented 2 years ago

The question/issue has become obsolete. I am here posting the result for documentation and potentially future considerations:

APPROACH:

Not all 2,967 epochs can be loaded into working memory, so I am loading batches of 500 epochs (depends on available RAM and size of point cloud epochs)
Each batch of epochs is the added to the analysis object, leading in this example to six add_epochs operations, i.e. six times costly rearranging of the array

THE CODE:

# loop over epochs and add (except the reference epoch) [num_epochs = 2967]
batchsize = 500
b = 0
epoch_batch = []
num_epochs = len(df_timeseries)
for idx, row in df_timeseries.iterrows():
    print(f'Compiling epoch [{idx}/{num_epochs}]')       
    fname_cloud = row['fname_timestring']
    if fname_cloud == fname_reference_cloud:
        continue
    subdir_daystring = fname_reference_cloud.split('_')[0]
    f_cloud = f'{data_dir}/{subdir_daystring}/{fname_reference_cloud}.laz'

    trafomat = read_corresponding_trafofile(f_cloud)
    cloud = read_las(f_cloud, trafomat = trafomat)

    epoch = py4dgeo.epoch.as_epoch(cloud)
    epoch.timestamp = row['timestamp']

    epoch_batch.append(epoch)

    b+=1

    if b%batchsize == 0:
        print('Adding batch of epochs...')
        analysis.add_epochs(*epoch_batch)
        epoch_batch = []

THE OUTPUT:

first batch: 146.0912s

[2022-06-18 10:39:57][INFO] Finished in 0.4481s: Adding epoch 500/500 to analysis object
[2022-06-18 10:39:57][INFO] Starting: Rearranging space-time array in memory
[2022-06-18 10:42:23][INFO] Finished in 146.0912s: Rearranging space-time array in memory
[2022-06-18 10:42:23][INFO] Starting: Updating disk-based analysis archive with new epochs

second batch: 343.0534s

[2022-06-18 10:47:31][INFO] Finished in 0.4425s: Adding epoch 500/500 to analysis object
[2022-06-18 10:47:31][INFO] Starting: Rearranging space-time array in memory
[2022-06-18 10:53:14][INFO] Finished in 343.0534s: Rearranging space-time array in memory
[2022-06-18 10:53:14][INFO] Starting: Updating disk-based analysis archive with new epochs

third batch: 583.9803s

[2022-06-18 10:58:44][INFO] Finished in 0.4561s: Adding epoch 500/500 to analysis object
[2022-06-18 10:58:44][INFO] Starting: Rearranging space-time array in memory
[2022-06-18 11:08:28][INFO] Finished in 583.9803s: Rearranging space-time array in memory
[2022-06-18 11:08:28][INFO] Starting: Updating disk-based analysis archive with new epochs
[2022-06-18 11:08:30][INFO] Finished in 1.2582s: Updating disk-based analysis archive with new epochs

fourth batch: 854.3708s

[2022-06-18 11:14:03][INFO] Finished in 0.4668s: Adding epoch 500/500 to analysis object
[2022-06-18 11:14:03][INFO] Starting: Rearranging space-time array in memory
[2022-06-18 11:28:18][INFO] Finished in 854.3708s: Rearranging space-time array in memory
[2022-06-18 11:28:18][INFO] Starting: Updating disk-based analysis archive with new epochs
[2022-06-18 11:28:20][INFO] Finished in 1.8058s: Updating disk-based analysis archive with new epochs

kathapand commented 2 years ago

Fifth batch (>2000) epochs cannot be added:

File H:\conda_envs\py4dgeo_dev\lib\site-packages\py4dgeo\segmentation.py:384, in SpatiotemporalAnalysis.add_epochs(self, *epochs)
    380     # Load the uncertainty array and append new data
    381     uncertainty_file = os.path.join(tmp_dir, uncertainty_filename)
    382     write_func(
    383         uncertainty_file,
--> 384         np.concatenate(
    385             (self.uncertainties, np.column_stack(tuple(new_uncertainties))),
    386             axis=1,
    387         ),
    388     )
    390 # Dump the updated files into the archive
    391 with logger_context("Updating disk-based analysis archive with new epochs"):

File <__array_function__ internals>:180, in concatenate(*args, **kwargs)

MemoryError: Unable to allocate 66.4 GiB for an array with shape (709791, 2512) and data type [('lodetection', '<f8'), ('spread1', '<f8'), ('num_samples1', '<i8'), ('spread2', '<f8'), ('num_samples2', '<i8')]

dokempf commented 2 years ago

I have done similar studies where I added constant size batches of epochs and observed dramatically increasing processing times due to the rearranging in memory. I will try to check whether np.concatenate/np.column_stack does something inefficient here by trying a C++ implementation of the rearranging step.

dokempf commented 2 years ago

As far as the out-of-memory issue is concerned: Uncertainties take 5 times more memory than distances. As you did not store them in your previous code base, you might not be able to process the same datasets with py4dgeo (on the same machine). Not sure what can be done about that - unless storing the uncertainties is not a requirement.

kathapand commented 2 years ago

That makes sense... but it is right to store the uncertainties, even if they are not always required for an analysis step.

(A) Maybe when the memory policy is set to "MINIMAL" the arrays can be loaded and extended separately? This increases the capacity - because for 4D-OBC extraction at the moment we only use the distances (in the future will be an option to include uncertainties).

(B) As a general issue of handling the huge 3D time series: I guess we need an option to perform the analysis for a part of the data, e.g. by defining a timespan. Thinking on long-term monitoring, we will add evermore epochs but analyze looking back only a certain timespan (not always the full period since start).

dokempf commented 2 years ago

I tried a C++ implementation of the rearranging step and it turns out to not be a significant improvement over the status quo. Closing this now.

3dgeo-heidelberg / py4dgeo

add_epochs for large number of initial epochs #159