caracal-pipeline / caracal

Containerized Automated Radio Astronomy Calibration (CARACal) pipeline
GNU General Public License v2.0
28 stars 6 forks source link

Unable to set aoflagger to `readmode: indirect` on ILIFU: error >> `Could not allocate temporary file 'aoflagger-data.tmp': posix_fallocate returned 28.` #1506

Closed pjmac1105 closed 1 month ago

pjmac1105 commented 1 year ago

hi team,

I am running CARACal on ILIFU, when I try and run flag: flagrfi: aoflagger: readmode: 'indirect' it will error with memory allocation issues.

The ILIFU cluster has fallocate installed, I checked with ILIFU support they have advised the disk has space and there is no quota in place on the scratch3/ directory that might affect this.

Flag worker scripted as follows:

flag__2:
  enable: true
  field: target
  label_in: corr
  flag_spw:
    enable: true
  flag_rfi:
    enable: true
    col: DATA
    flagger: aoflagger
    aoflagger:
      strategy: firstpass_QUV.rfis
      # readmode: indirect

Full log attached, here is the error:

# Starting strategy on 2023-Apr-20 13:10:29.156712
# 0% : strategy...
# 0% : +-For each measurement set...
# 0% : +-+-Processing measurement set /stimela_mount/msdir/1619838073_sdp_l0-2207_5806-corr.ms...
# 0% : +-+-+-For each baseline...
# 0% : +-+-+-+-Initializing...
# Could not allocate temporary file 'aoflagger-data.tmp': posix_fallocate returned 28.
# Tried to allocate 651685 MB.
# Disk could be full or filesystem could not support fallocate.
# Could not allocate temporary file 'aoflagger-flags.tmp': posix_fallocate returned 28.
# Tried to allocate 81460 MB.
# Disk could be full or filesystem could not support fallocate.
# terminate called after throwing an instance of 'std::runtime_error'
#   what():  Error: failed to write to reordered file! Check access rights and free disk space.
# Traceback (most recent call last):
#   File "/stimela_mount/code/run.py", line 45, in <module>
#     subprocess.check_call(shlex.split(_runc))
#   File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
#     raise CalledProcessError(retcode, cmd)
# subprocess.CalledProcessError: Command '['aoflagger', '-strategy', '/stimela_mount/input/firstpass_QUV.rfis', '-indirect-read', '-column', 'DATA', '-fields', '0', '/stimela_mount/msdir/1619838073_sdp_l0-2207_5806-corr.ms']' returned non-zero exit status -6

Lo7flg2-6044233_readmode_error.txt

thanks! Pete

pjmac1105 commented 1 year ago

hi team - just wondering if anyone has had a chance to look at this? I need some guidance on whether this is something I am doing wrong, whether it is a CARACal issue, or if there is something I need to chase with ILUFU?

Thanks! Pete

paoloserra commented 1 year ago

Sorry @pjmac1105 , I personally have little experience with the indirect mode, and typically use auto on both 32k and 4k data on ilifu and other environments.

Have you tried that mode yourself to see whether it works well in a reasonable amount of time?

Concerning the indirect mode, I'm pinging @edeblok here, who based on #1088 I suppose uses that mode regularly.

pjmac1105 commented 1 year ago

Thanks @paoloserra! I've been using it in auto which defaults to baseline by baseline, and depending on the dataset it can take upwards of 2 days.

edeblok commented 1 year ago

"indirect" tries to write a temporary file on disk that it then uses for reading and writing during the process. My guess would be that aoflagger tries to write it in a location where it does not have sufficient permission. We've been using it without issue on our local machines, but there we have full control over permissions etc.

pjmac1105 commented 1 year ago

My guess would be that aoflagger tries to write it in a location where it does not have sufficient permission.

Hi @edeblok thanks :) that was my guess as well, it is good to know that (1) indirect does work and (2) it narrows down where to look. ILIFU have said there are no quotas or permission issues but i shall double check.

Could I ask, do you see any issues with this worker script that might cause the problem?

flag__2:
  enable: true
  field: target
  label_in: corr
  flag_spw:
    enable: true
  flag_rfi:
    enable: true
    col: DATA
    flagger: aoflagger
    aoflagger:
      strategy: firstpass_QUV.rfis
      readmode: indirect
edeblok commented 1 year ago

Script looks fine to me. Note that https://aoflagger.readthedocs.io/en/latest/using_aoflagger.html mentions that "the current working directory will be used as a temporary storage location" and "these will take up a volume equal to the size of the measurement set". Perhaps that can help tracking this down.

Athanaseus commented 1 month ago

Please re-open if experiencing the issue.