Thomas-Moore-Creative / Climatology-generator-demo

A demonstration / MVP to show how one could build an "interactive" climatology & compositing tool on Gadi HPC.
MIT License
0 stars 0 forks source link

solve HPC stderr & stdout issues #5

Closed Thomas-Moore-Creative closed 6 months ago

Thomas-Moore-Creative commented 7 months ago

I have very large rechunking operations ( making use of rechunker ) that I'm attempting to run on HPC. I'm having an issue where these long jobs (~24 hours) are being killed due to excessively large standard output and error streams.

I can redirect my STDERR & OUT to files on my storage but it made me think about turning off DASK logging - which is 99% of the output content. SEE THIS NCI DOC

However this is not working

for example a LocalCluster with these settings:

import dask
import distributed

with dask.config.set({"distributed.scheduler.worker-saturation": 1.0,
                      "distributed.nanny.pre-spawn-environ.MALLOC_TRIM_THRESHOLD_": 0,
                    "logging.distributed'": "error"}):
    client = distributed.Client()

will still report warnings and info messages like:

2024-04-07 04:40:27,842 - distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)

and

UserWarning: Sending large graph of size 2.02 GiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(
Thomas-Moore-Creative commented 6 months ago

still not confident I understand how to fully control warnings and errors but this approach in the PBS script gives me a live log exported to a custom file with PBS job ID:

python -u ../composited_stats_BRAN2020.py > ./logs/$PBS_JOBID.log 2>&1