DiamondLightSource / httomo

High-throughput tomography pipeline
https://diamondlightsource.github.io/httomo/
Other
5 stars 4 forks source link

CPU memory leak from reader and writer objects not getting garbage collected across iterations over sections #264

Closed yousefmoazzam closed 6 months ago

yousefmoazzam commented 6 months ago

Original encounter with the issue: running on real data

This issue was detected when experiencing an OOM error when running this pipeline on 20GB data. It was a single process run, on a SLURM node with a V100 GPU, and 100GB of RAM was requested for the interactive session.

Note that the issue only become apparent when using RAM as backing to the store, because the way it was detected was by an OOM error due to RAM overflow.

Furthermore, even when RAM is the backing for the store, detection of this issue can fail if:

due to not hitting an OOM error as quickly as other cases.

High-level explanation of why the OOM error looked like it was happening

When the rescale_to_int method was reported in the pipeline execution output to have started executing, an OOM error occurred. 100GB of RAM was thought to have been more than enough to hold all the necessary input/output numpy arrays at that point in the pipeline. The OOM error suggested that there were large unneeded objects still stored in RAM that had not been freed earlier in the pipeline.

Below is a screenshot of the memory usage of httomo during a run replicating the same circumstances as the original encounter (same SLURM interactive session job params, such as a 100GB RAM limit):

original-run-killed

It's not enough information to conclude what is happening, but note that at the end the pipeline has moved onto the rescale_to_int method (a new section) and is in the middle of attempting to write the first block of the new section.

In particular, for a RAM-based run, this involves writing the entire chunk to a numpy file, and thus it's attempting to write a ~40GB numpy array to RAM (40GB because of the input data being converted to float32).

It's expected that memory would be freed from the previous section before the write of the current section's first block to the data store would be attempted, but this doesn't look to be the case: there's no significant drop in memory usage before rescale_to_int attempts to write its first block.

Detailed explanation of how the OOM error could be happening

The expected behaviour is to that, after section n completes, the RAM resets to where it was before section n started. This would be due to the numpy array used as input for section n not being in use when section n+1 starts and thus getting cleaned up by the garbage collector, freeing the associated RAM. Put another way, the output numpy array for section n makes the input array for section n redundant/obsolete, hence the input for section n is no longer needed once section n+1 starts.

The very first section contains a loader, so it's a bit different to other sections in terms of what form its input comes in. For every section after the very first one:

As described earlier, once section n has completed and the writer has written all output to the output numpy array, the input numpy array for section n is no longer needed. More broadly, the reader object for section n (which has a reference to the input numpy array for section n) is no longer needed.

If the input numpy arrays associated with reader objects weren't being garbage collected (which is implied by the earlier notes about seemingly leaking memory), then one way this could happen is if the reader objects themselves were not being garbage collected when we expected them to.

Only the TaskRunner object has references to the reader and writer objects, However, due to the reassignment of self.source and self.sink in the task runner, it didn't seem like there was any obvious reason for reader and/or writer objects to have a higher than expected reference count to prevent them from being garbage collected.

Reproducing the issue with a pipeline intended to emphasise the memory leak

A simpler pipeline was created to test the hypothesis that reader and/or writer objects were not being garbage collected. The pipeline involved artificially creating many sections, to see:

The pipeline used was the following:

- method: standard_tomo
  module_path: httomo.data.hdf.loaders
  parameters:
    name: tomo
    data_path: entry1/tomo_entry/data/data
    image_key_path: entry1/tomo_entry/instrument/detector/image_key
    rotation_angles:
      data_path: /entry1/tomo_entry/data/rotation_angle
- method: paganin_filter_tomopy
  module_path: httomolibgpu.prep.phase
  parameters:
    pixel_size: 0.5
    dist: 50.0
    energy: 53.0
    alpha: 0.0001
- method: remove_stripe_based_sorting
  module_path: httomolibgpu.prep.stripe
  parameters:
    size: 11
    dim: 1
- method: paganin_filter_tomopy
  module_path: httomolibgpu.prep.phase
  parameters:
    pixel_size: 0.5
    dist: 50.0
    energy: 53.0
    alpha: 0.0001
- method: remove_stripe_based_sorting
  module_path: httomolibgpu.prep.stripe
  parameters:
    size: 11
    dim: 1
- method: paganin_filter_tomopy
  module_path: httomolibgpu.prep.phase
  parameters:
    pixel_size: 0.5
    dist: 50.0
    energy: 53.0
    alpha: 0.0001
- method: remove_stripe_based_sorting
  module_path: httomolibgpu.prep.stripe
  parameters:
    size: 11
    dim: 1
- method: paganin_filter_tomopy
  module_path: httomolibgpu.prep.phase
  parameters:
    pixel_size: 0.5
    dist: 50.0
    energy: 53.0
    alpha: 0.0001
- method: remove_stripe_based_sorting
  module_path: httomolibgpu.prep.stripe
  parameters:
    size: 11
    dim: 1

Notice how the alternating paganin filter and stripe removal methods will causes multiple sections to be created.

Memory usage during run on full data with artificial multiple sections pipeline

The following is an image displaying the memory usage of a single process run on 20GB data with the above pipeline. The SLURM interactive session was requested to have 360GB of RAM, to allow the httomo run to keep going and thus highlight the memory leak. Note the memory usage increasing across sections without significantly decreasing:

memory-leak-highlighted

Misc info: printing during run on small data to try and debug the issue

Below is output containing various print statements. It's messy, but the main points are:

This confirmed the suspicion that:

were not being garbage collected in the appropriate/expected places during pipeline execution, and it was only upon program termination that they were garbage collected.

Running loader (pattern=projection): standard_tomo...
The full dataset shape is (220, 128, 160)
Loading data: /dls/science/users/twi18192/httomo/tests/test_data/tomo_standard.nxs
Path to data: entry1/tomo_entry/data/data
Preview: (0:180, 0:128, 0:160)
Data shape is (180, 128, 160) of type uint16
    Finished loader (pattern=projection): standard_tomo (httomo) Took 23.13ms
id of reassigned source is 140229659079728, has type <class 'httomo.loaders.standard_tomo_loader.StandardTomoLoader'>
id of reassigned sink is 140229659078048, had type <class 'httomo.data.dataset_store.DataSetStoreWriter'>
Running centering (pattern=projection): find_center_vo...
    Finished centering (pattern=projection): find_center_vo (httomolibgpu) Took 763.97ms
Running task_2 (pattern=projection): remove_outlier...
    Finished task_2 (pattern=projection): remove_outlier (httomolibgpu) Took 288.40ms
Running task_2 (pattern=projection): data_reducer...
    Finished task_2 (pattern=projection): data_reducer (httomolib) Took 0.54ms
Running task_3 (pattern=projection): normalize...
    Finished task_3 (pattern=projection): normalize (httomolibgpu) Took 42.18ms
Running task_4 (pattern=projection): paganin_filter_tomopy...
    Finished task_4 (pattern=projection): paganin_filter_tomopy (httomolibgpu) Took 155.64ms
Before reassinging self.source, it has type <class 'httomo.loaders.standard_tomo_loader.StandardTomoLoader'>
Before calling source.finalize() in DataSetSToreReader constructor
Writer finalize() method is called on object ID 140229659078048, has type <class 'httomo.data.dataset_store.DataSetStoreWriter'>
After calling source.finalize() in DataSetSToreReader constructor
After reassinging self.source, it has type <class 'httomo.data.dataset_store.DataSetStoreReader'>
id of reassigned source is 140229659048736, has type <class 'httomo.data.dataset_store.DataSetStoreReader'>
id of reassigned sink is 140229659034768, had type <class 'httomo.data.dataset_store.DataSetStoreWriter'>
Running task_5 (pattern=sinogram): remove_stripe_based_sorting...
    Finished task_5 (pattern=sinogram): remove_stripe_based_sorting (httomolibgpu) Took 307.37ms
Running task_6 (pattern=sinogram): FBP...
    Finished task_6 (pattern=sinogram): FBP (httomolibgpu) Took 140.78ms
Running save_task_6 (pattern=sinogram): save_intermediate_data...
    Finished save_task_6 (pattern=sinogram): save_intermediate_data (httomo) Took 9.14ms
Running task_7 (pattern=sinogram): median_filter...
    Finished task_7 (pattern=sinogram): median_filter (httomolibgpu) Took 132.68ms
Running statistics (pattern=sinogram): calculate_stats...
    Finished statistics (pattern=sinogram): calculate_stats (httomo) Took 22.23ms
Before reassinging self.source, it has type <class 'httomo.data.dataset_store.DataSetStoreReader'>
Before calling source.finalize() in DataSetSToreReader constructor
Writer finalize() method is called on object ID 140229659034768, has type <class 'httomo.data.dataset_store.DataSetStoreWriter'>
After calling source.finalize() in DataSetSToreReader constructor
After reassinging self.source, it has type <class 'httomo.data.dataset_store.DataSetStoreReader'>
id of reassigned source is 140229786897664, has type <class 'httomo.data.dataset_store.DataSetStoreReader'>
id of reassigned sink is 140229659047056, had type <class 'httomo.runner.dataset_store_interfaces.DummySink'>
Running task_9 (pattern=sinogram): rescale_to_int...
    Finished task_9 (pattern=sinogram): rescale_to_int (httomolibgpu) Took 44.61ms
Running task_10 (pattern=sinogram): save_to_images...
    Finished task_10 (pattern=sinogram): save_to_images (httomolib) Took 116.37ms
Pipeline finished. Took 2.690s
Reader finalize() method is called for object ID 140229786897664, has type <class 'httomo.data.dataset_store.DataSetStoreReader'>
self._data.shape is (160, 128, 160)
Writer finalize() method is called on object ID 140229659034768, has type <class 'httomo.data.dataset_store.DataSetStoreWriter'>
Reader finalize() method is called for object ID 140229659048736, has type <class 'httomo.data.dataset_store.DataSetStoreReader'>
self._data.shape is (180, 128, 160)
Writer finalize() method is called on object ID 140229659078048, has type <class 'httomo.data.dataset_store.DataSetStoreWriter'>
Loading's finalize() method is called on object ID 140229659079728
yousefmoazzam commented 6 months ago

Fixed by #265