caracal-pipeline / caracal

Containerized Automated Radio Astronomy Calibration (CARACal) pipeline
GNU General Public License v2.0
28 stars 6 forks source link

flag_autopowerspec fails with 'IOError: [Errno 28] No space left on device ' #1561

Open spectram opened 5 months ago

spectram commented 5 months ago

While running the flag worker on a 32K - 100 MHz dataset on ilifu, the job fails at the autopowerspec flagging step. The full log is attached, an error that sticks out is # IOError: [Errno 28] No space left on device - There is an All-NaN slice encountered warning just before the error.

The yaml inputs are as follows.

flag__calib:
  enable: true
  field: calibrators
  label_in: cal
  flag_autocorr:   
    enable: true
  flag_spw:
    enable: true 
    chans:  '*:1419.8~1421.3MHz'
    ensure_valid: true
  flag_mask:
    enable: true
    mask: labelled_rfimask.pickle.npy
    uvrange: '0~1000'
  flag_shadow:
    enable: true
    full_mk64: true
  flag_autopowerspec:
    enable: true
  flag_rfi:
    enable: true
    flagger: aoflagger
    aoflagger:
      strategy: firstpass_Q.rfis

The scratch3 mount has sufficient storage and the job has more than enough RAM (160GB across 10 cores - Seff returns ~ 25% memory efficiency). Jeremy also confirmed that there were no alerts about the local disk on the compute node reaching capacity. Two attempts on this dataset with slightly different memory allocations have yielded the same result. I haven't encountered this error on other datasets. Please advise.

log-caracal-autopowerspec.txt

paoloserra commented 5 months ago

@bennahugo could you have a look?

bennahugo commented 5 months ago

Not very familiar with the Ilifu cluster setup, so can't really comment there. It looks like an space problem on the cluster in your run directory to me in making plots at high resolution. You may need to run from somewhere with more space and not home, depending on your quota allocations?

Default for the plotting is 300 dpi. Maybe setting this much lower in your recipe may help?

https://github.com/ratt-ru/Stimela-classic/blob/b51b98f530faa016b93351508e14fbfb6a45554e/stimela/cargo/cab/politsiyakat_autocorr_amp/parameters.json#L64C1-L69C12

I don't use this software any more though -- it is much better and reliable to flag GNSS saturation, LNA cycling errors and dropouts by hand - I would recommend this approach. Plotting the autocorrelations (*&&& notation in CASA) and flagging the relevant time periods is quick to do.

paoloserra commented 5 months ago

@spectram unfortunately the dpi is not a user setting in caracal. If you want to test @bennahugo 's idea you could try to set the dpi to a much lower value in https://github.com/caracal-pipeline/caracal/blob/25161c2b6ab02c1a76becdc340e7e6611f905607/caracal/workers/flag_worker.py#L130 .

spectram commented 5 months ago

Thanks @paoloserra and @bennahugo. The mount should have sufficient memory (83 TB remaining I believe). I will check with ilifu helpdesk once again. I can also try modifying the dpi parameter in the code to see if that resolves the issue. Further, if space were the issue, other datasets would have produced the same error right?

bennahugo commented 5 months ago

Not sure -- it is clearly an IO error. It is likely not the total disk capacity that is an issue, but your quotas (quota -s). Most likely it is to do with where you are running. I can imagine making waterfall plots of 32k channels at high resolution may cause space issues if you are running from an entrypoint where there isn't a lot of quota (home instead of scratch space?)

On Thu, Jan 18, 2024 at 11:17 AM Sriram Sankar @.***> wrote:

Thanks @paoloserra https://github.com/paoloserra and @bennahugo https://github.com/bennahugo. The mount should have sufficient memory (83 TB remaining I believe). I will check with ilifu helpdesk once again. I can also try modifying the dpi parameter in the code to see if that resolves the issue. Further, if space were the issue, other datasets would have produced the same error right?

— Reply to this email directly, view it on GitHub https://github.com/caracal-pipeline/caracal/issues/1561#issuecomment-1898088807, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6TKFVDYDG6QCIHDEVTYPDSEHAVCNFSM6AAAAABB7ZADS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGA4DQOBQG4 . You are receiving this because you were mentioned.Message ID: @.***>

--

Benjamin Hugo

spectram commented 5 months ago

I changed dpi to 100 and the error persists. Perhaps it has something to do with the "All NaN slice encountered" runtime warning. I am attaching the latest log below (It's the same as the previous log).

From Jeremy (ilifu):

We don't have quotas implemented on the filesystem, only the total share size would impact the capacity, and we would have been alerted if any of the shares had been maxed out. Maybe it's worth looking into the All-NaN slice aspect of the error?

I am leaning towards not using flag_autopowerspec to circumvent the issue.

log-caracal_dpi100.txt

bennahugo commented 5 months ago

Not sure, perhaps something is funky with boundaries or something when there is no unflagged data in a chunk. You can try adjusting the chunk length perhaps https://github.com/ratt-ru/Stimela-classic/blob/b51b98f530faa016b93351508e14fbfb6a45554e/stimela/cargo/cab/politsiyakat_autocorr_amp/parameters.json#L47-L50

I would just disable this step and flag saturation events by hand though :)

Athanaseus commented 2 weeks ago
# politsiyakat - 2024-01-18 08:33:54,900 INFO - Updating flags for chunk 68 of 68...
# politsiyakat - 2024-01-18 08:33:54,924 INFO -     Reading MS
# politsiyakat - 2024-01-18 08:33:55,229 INFO -     Selecting field J1331+3030...
# politsiyakat - 2024-01-18 08:33:55,230 INFO -         Nothing to be done for this field
# politsiyakat - 2024-01-18 08:33:55,230 INFO -     Selecting field J1726-5529...
# politsiyakat - 2024-01-18 08:33:55,230 INFO -         Nothing to be done for this field
# politsiyakat - 2024-01-18 08:33:55,230 INFO -     Selecting field J1939-6342...
# politsiyakat - 2024-01-18 08:33:55,230 INFO -         Updating flags for scan 15
# politsiyakat - 2024-01-18 08:33:55,410 INFO -         Writing flag buffer back to disk...
# politsiyakat - 2024-01-18 08:33:55,512 INFO - Creating waterfall plots:
# politsiyakat - 2024-01-18 08:33:55,513 INFO -     Interpolating onto a common axis...
# politsiyakat - 2024-01-18 08:34:25,812 INFO -          J1331+3030 Done...
# /usr/local/lib/python2.7/dist-packages/politsiyakat/modules/flag_tasks.py:941: RuntimeWarning: divide by zero encountered in log10
#   dbheatmaps = 10 * np.log10(heatmaps[:, :, :, :, :])
# /usr/local/lib/python2.7/dist-packages/politsiyakat/modules/flag_tasks.py:945: RuntimeWarning: All-NaN axis encountered
#   scale_min = np.nanmin(dbheatmaps[field_i, :, corr, :, :])
# /usr/local/lib/python2.7/dist-packages/politsiyakat/modules/flag_tasks.py:946: RuntimeWarning: All-NaN slice encountered
#   scale_max = np.nanmax(dbheatmaps[field_i, :, corr, :, :])
# politsiyakat - 2024-01-18 08:38:48,844 INFO -          J1726-5529 Done...
# politsiyakat - 2024-01-18 08:43:04,123 INFO -          J1939-6342 Done...
# politsiyakat - 2024-01-18 08:46:05,277 INFO - Waiting for remaining jobs to finish...
# Traceback (most recent call last):
#   File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
#     "__main__", fname, loader, pkg_name)
#   File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
#     exec code in run_globals
#   File "/usr/local/lib/python2.7/dist-packages/politsiyakat/__main__.py", line 25, in <module>
#     politsiyakat.main(sys.argv[1:])
#   File "/usr/local/lib/python2.7/dist-packages/politsiyakat/__init__.py", line 137, in main
#     run_func(**args.kwargs)
#   File "/usr/local/lib/python2.7/dist-packages/politsiyakat/modules/flag_tasks.py", line 968, in flag_autocorr_drifts
#     corr))
#   File "/usr/local/lib/python2.7/dist-packages/matplotlib/figure.py", line 2062, in savefig
#     self.canvas.print_figure(fname, **kwargs)
#   File "/usr/local/lib/python2.7/dist-packages/matplotlib/backend_bases.py", line 2263, in print_figure
#     **kwargs)
#   File "/usr/local/lib/python2.7/dist-packages/matplotlib/backends/backend_agg.py", line 532, in print_png
#     self.figure.dpi, metadata=metadata)
#   File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
#     self.gen.throw(type, value, traceback)
#   File "/usr/local/lib/python2.7/dist-packages/matplotlib/cbook/__init__.py", line 629, in open_file_cm
#     yield fh
# IOError: [Errno 28] No space left on device
# Traceback (most recent call last):
#   File "/stimela_mount/code/run.py", line 39, in <module>
#     subprocess.check_call(shlex.split(_runc))
#   File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
#     raise CalledProcessError(retcode, cmd)
# subprocess.CalledProcessError: Command '['python', '-m', 'politsiyakat', 'flag_autocorr_drifts', '-s', 'antenna_mod', '{"plot_size": 6, "nrows_chunk": 5000, "data_column": "DATA", "scan_to_scan_threshold": 3, "cal_field": "0,1,2", "nio_threads": 1, "field": "0,1,2", "output_dir": "./", "simulate": false, "msname": "/stimela_mount/msdir/1702784778_sdp_l0-cal.ms", "nproc_threads": 8, "dpi": 300, "antenna_to_group_threshold": 5}']' returned non-zero exit status 1
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR: cd /scratch3/projects/meerrings/AM1724-622/mkt-HI/.stimela_workdir-1705558428562042 && singularity run --workdir /scratch3/projects/meerrings/AM1724-622/mkt-HI/.stimela_workdir-1705558428562042 --containall returns error code 1
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR: job failed at 2024-01-18 08:46:07.154709 after 0:32:01.159360
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR: Traceback (most recent call last):
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR:   File "/scratch3/users/spectram/caracal_mod/caracal_env/lib/python3.9/site-packages/stimela/recipe.py", line 713, in run
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR:     job.run_job()
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR:   File "/scratch3/users/spectram/caracal_mod/caracal_env/lib/python3.9/site-packages/stimela/recipe.py", line 425, in run_job
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR:     self.job.run(output_wrangler=self.apply_output_wranglers)
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR:   File "/scratch3/users/spectram/caracal_mod/caracal_env/lib/python3.9/site-packages/stimela/singularity.py", line 123, in run
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR:     utils.xrun(f"cd {self.execdir} && singularity run --workdir {self.execdir} --containall",
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR:   File "/scratch3/users/spectram/caracal_mod/caracal_env/lib/python3.9/site-packages/stimela/utils/xrun_poll.py", line 227, in xrun
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR:     raise StimelaCabRuntimeError("{} returns error code {}".format(command_name, status))
2024-01-18 08:46:07 CARACal.Stimela.flag__calib-autopowerspec-ms0 ERROR: stimela.utils.StimelaCabRuntimeError: cd /scratch3/projects/meerrings/AM1724-622/mkt-HI/.stimela_workdir-1705558428562042 && singularity run --workdir /scratch3/projects/meerrings/AM1724-622/mkt-HI/.stimela_workdir-1705558428562042 --containall returns error code 1
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Completed jobs : ['save-P2_flag__calib_before-ms0']
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Remaining jobs : ['flag__calib-autocorr-ms0', 'flag__calib-shadow-ms0', 'flag__calib-spw-ms0', 'flag__calib-mask-ms0', 'flag__calib-rfi-ms0', 'flag__calib-summary-ms0']
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Logging remaining task: flag__calib-autocorr-ms0:: Flag auto-correlations ms=1702784778_sdp_l0-cal.ms
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Logging remaining task: flag__calib-shadow-ms0:: Flag shadowed antennas ms=1702784778_sdp_l0-cal.ms
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Logging remaining task: flag__calib-spw-ms0::Flag out channels ms=1702784778_sdp_l0-cal.ms
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Logging remaining task: flag__calib-mask-ms0:: Apply flag mask ms=1702784778_sdp_l0-cal.ms
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Logging remaining task: flag__calib-rfi-ms0:: AOFlagger auto-flagging flagging pass ms=1702784778_sdp_l0-cal.ms fields=J1331+3030,J1726-5529,J1939-6342
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Logging remaining task: flag__calib-summary-ms0:: Flagging summary  ms=1702784778_sdp_l0-cal.ms
2024-01-18 08:46:07 CARACal.Stimela.flag__calib INFO: Saving pipeline information in .last_flag__calib.json
2024-01-18 08:46:07 CARACal ERROR: Job 'flag__calib-autopowerspec-ms0:: Flag out antennas with drifts in autocorrelation powerspectra ms=1702784778_sdp_l0-cal.ms' failed: cd /scratch3/projects/meerrings/AM1724-622/mkt-HI/.stimela_workdir-1705558428562042 && singularity run --workdir /scratch3/projects/meerrings/AM1724-622/mkt-HI/.stimela_workdir-1705558428562042 --containall returns error code 1 [PipelineException]
2024-01-18 08:46:07 CARACal INFO:   More information can be found in the logfile at output/logs-20240118-075741/log-caracal.txt
2024-01-18 08:46:07 CARACal INFO:   You are running version 1.0.6-284-g4f716f7c
2024-01-18 08:46:07 CARACal ERROR: Traceback (most recent call last):
2024-01-18 08:46:07 CARACal ERROR:   File "/scratch3/users/spectram/caracal_mod/caracal/caracal/main.py", line 189, in __run
2024-01-18 08:46:07 CARACal ERROR:     pipeline.run_workers()
2024-01-18 08:46:07 CARACal ERROR:   File "/scratch3/users/spectram/caracal_mod/caracal/caracal/workers/worker_administrator.py", line 441, in run_workers
2024-01-18 08:46:07 CARACal ERROR:     worker.worker(self, recipe, config)
2024-01-18 08:46:07 CARACal ERROR:   File "/scratch3/users/spectram/caracal_mod/caracal/caracal/workers/flag_worker.py", line 526, in worker
2024-01-18 08:46:07 CARACal ERROR:     recipe.run()
2024-01-18 08:46:07 CARACal ERROR:   File "/scratch3/users/spectram/caracal_mod/caracal_env/lib/python3.9/site-packages/stimela/recipe.py", line 764, in run
2024-01-18 08:46:07 CARACal ERROR:     raise PipelineException(exc, self.completed, job, self.remaining) from None
2024-01-18 08:46:07 CARACal ERROR: stimela.exceptions.PipelineException: Job 'flag__calib-autopowerspec-ms0:: Flag out antennas with drifts in autocorrelation powerspectra ms=1702784778_sdp_l0-cal.ms' failed: cd /scratch3/projects/meerrings/AM1724-622/mkt-HI/.stimela_workdir-1705558428562042 && singularity run --workdir /scratch3/projects/meerrings/AM1724-622/mkt-HI/.stimela_workdir-1705558428562042 --containall returns error code 1
2024-01-18 08:46:07 CARACal INFO: exiting with error code 1