askap-vast / vast-pipeline

This repository holds the code of the Radio Transient detection pipeline for the VAST project.
https://vast-survey.org/vast-pipeline/
MIT License
7 stars 3 forks source link

Potential high memory usage at new sources rms measurements step #649

Open ajstewart opened 2 years ago

ajstewart commented 2 years ago

When there are a very high number of new sources, likely single epoch, combined with a lot of images this has the potential for the new source analysis to get unwieldy.

In the example error below - this was from a run of short timescale images that are very susceptible to single epoch artefacts. There were roughly 3000 images in the run with each image having 1 - 10 measurements. Assuming half of the total measurements were single epoch new sources, so say average 5 per image, that's 3000 * 5 measurements that need to be measured in 2999 other images - just short of 45 million rms measurements required...

In particular this became a problem at the stage of the new source analysis where the data frames are merged after fetching the rms pixel measurements.

This could be reduced by addressing #327 and making sure the dataframes are as lightweight as possible. Also there may be scope to improve this dataframe stage of the new source analysis to avoid such a huge merge.

This problem can also be mitigated by tweaking the pipeline settings, namely raising the new source minimum rms image threshold in the config to a high value - this acts to pretty much 'turn off' the new source stage. Probably source monitoring should be turned off as well. Basic association could also be employed to eliminate many-to-one and many-to-many associations.

Eventually some stages of the pipeline will have to revisited in general to see how the pandas memory footprint can be reduced, either by refactoring or brining in other tools. The Dask Cluster transition (#335) could also open up other avenues in how to process the data.

2022-03-30 20:49:24,538 new_sources INFO Starting new source analysis.
2022-03-30 21:40:15,414 runpipeline ERROR Processing error:
Unable to allocate 129. GiB for an array with shape (17295864709,) and data type int64
Traceback (most recent call last):
  File "/usr/src/vast-pipeline/vast-pipeline-dev/vast_pipeline/management/commands/runpipeline.py", line 340, in run_pipe
    pipeline.process_pipeline(p_run)
  File "/usr/src/vast-pipeline/vast-pipeline-dev/vast_pipeline/pipeline/main.py", line 256, in process_pipeline
    new_sources_df = new_sources(
  File "/usr/src/vast-pipeline/vast-pipeline-dev/vast_pipeline/pipeline/new_sources.py", line 413, in new_sources
    new_sources_df = parallel_get_rms_measurements(
  File "/usr/src/vast-pipeline/vast-pipeline-dev/vast_pipeline/pipeline/new_sources.py", line 233, in parallel_get_rms_measurements
    df = df.merge(
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 9339, in merge
    return merge(
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 122, in merge
    return op.get_result()
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 716, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 967, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 941, in _get_join_indexers
    return get_join_indexers(
  File "/usr/src/vast-pipeline/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1509, in get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)  # type: ignore[operator]
  File "pandas/_libs/join.pyx", line 101, in pandas._libs.join.left_outer_join
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 129. GiB for an array with shape (17295864709,) and data type int64