Memory issues while generating hips

Schwarzam commented 2 months ago

I'm generating HIPS over all S-PLUS DR4 dual photometry. The dataset has 160gb, composed by 1412 files.

We use Ubuntu 22.04 40gb of RAM 24 CPU cores

So If I run for a small fraction of the dataset, everything goes fine. But with the whole dataset I'm experiencing some memory issues leading to errors.

I set Client(memory_limit="20GB") just to be sure.

In the reducing step, this warning below is raised multiple times after ~15% progress.

2024-05-08 13:51:47,158 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the
 memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the
-os for more information. -- Unmanaged memory: 13.09 GiB -- Worker memory limit: 18.63 GiB

I watched the htop while running it and the memory increases until it hit the max of the machine and then also it fills the swap. After this, it starts to give errors like:

2024-05-08 13:51:58,116 - distributed.worker - WARNING - Compute Failed
Key:       reduce_pixel_shards-7ba00d127f1b0b6d31357c24fc765d79
Function:  reduce_pixel_shards
args:      ()
kwargs:    {'cache_shard_path': '/storage2/splus/HIPS/catalogs/dr4/dual/intermediate', 'resume_path': '/storage2/splus/HIPS/catalogs
/dr4/dual/intermediate', 'reducing_key': '2_72', 'destination_pixel_order': 2, 'destination_pixel_number': 72, 'destination_pixel_si
ze': 160841, 'output_path': '/storage2/splus/HIPS/catalogs/dr4/dual', 'ra_column': 'RA', 'dec_column': 'DEC', 'sort_columns': 'ID',
'add_hipscat_index': True, 'use_schema_file': None, 'use_hipscat_index': False, 'storage_options': None}
Exception: "FileNotFoundError('/storage2/splus/HIPS/catalogs/dr4/dual/intermediate/order_2/dir_0/pixel_72')"

Investigating dask docs at https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os, It seems that a possible solution to linux is to manually free memory with:

import ctypes

def trim_memory() -> int:
    libc = ctypes.CDLL("libc.so.6")
    return libc.malloc_trim(0)

client.run(trim_memory)

The problem is that this seems to be a implementation to free the memory within the client instance in the main thread only.

Any ideia on how to move on with this?

Schwarzam commented 2 months ago

This issue is directly related to #267

Schwarzam commented 2 months ago

A workaround was to lower the pixel_threshold. Worked for me.

delucchi-cmu commented 2 months ago

That's interesting! We have a notebook to estimate what your pixel_threshold should be, according to your data. You could see if the results from the notebook match your new value: https://hipscat-import.readthedocs.io/en/stable/notebooks/estimate_pixel_threshold.html

nevencaplar commented 1 month ago

I am closing this as it seems to be solved for now. Unmanaged memory issues continue to plague us...

astronomy-commons / hipscat-import

Memory issues while generating hips #295