caracal-pipeline / caracal

Containerized Automated Radio Astronomy Calibration (CARACal) pipeline
GNU General Public License v2.0
28 stars 6 forks source link

Run crashes after set_model #1557

Closed ktrehaeven closed 2 weeks ago

ktrehaeven commented 6 months ago

Hi

I'm trying to reduce meerkat uhf data but get the error attached when it completes the crosscal.set_model step of the pipeline. Not sure whats causing this behaviour. I'm running on the janis machine of the rhodes server. Any ideas?

Not sure but might be related to https://github.com/caracal-pipeline/caracal/issues/1533

log-caracal.txt

Athanaseus commented 2 weeks ago
**2023-12-13 10:22:05 CARACal.Stimela.set_model_cal-0 INFO: Starting container [set_model_cal_0-14047552536056017024556570835667]. Timeout set to -1. The container ID is printed below.
# running /usr/local/bin/singularity run --workdir /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/.stimela_workdir-1702455655271989 --containall --userns  --bind /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/.stimela_workdir-1702455655271989/stimela_parameter_files/set_model_cal_0-14047552536056017024556570835667.json:/stimela_mount/configfile:ro --bind /home/trehaeven/caracal_venv/lib/python3.8/site-packages/stimela/cargo/cab/crystalball/src:/stimela_mount/code:ro --bind /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/.stimela_workdir-1702455655271989/passwd:/etc/passwd:rw --bind /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/.stimela_workdir-1702455655271989/group:/etc/group:rw --bind /home/trehaeven/caracal_venv/bin/stimela_runscript:/singularity:ro --bind /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/msdir:/stimela_mount/msdir:rw --bind /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/input:/stimela_mount/input:ro --bind /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/output:/stimela_mount/output:rw --bind /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/output/tmp:/stimela_mount/output/tmp:rw /home/trehaeven/singularity/stimela_codex-africanus_1.6.7.sif /singularity
# INFO:    Converting SIF file to temporary sandbox...
# WARNING: underlay of /etc/localtime required more than 50 (106) bind mounts
# 2023-12-13 10:22:09 | INFO     | crystalball:_predict | Crystalball version 0.3.0
# 2023-12-13 10:22:14 | INFO     | wsclean:import_from_wsclean | /stimela_mount/input/1934-collapsed-uhf-cat.txt contains 7830 components
# 2023-12-13 10:22:14 | INFO     | wsclean:import_from_wsclean | Total flux of 7830 selected components is 18.317970 Jy
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | --------------------------------------------------
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | Budgeting
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | --------------------------------------------------
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | system RAM = 503.55 GB
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | nr of logical CPUs = 48
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | nr sources = 7830
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | nr rows    = 1634880
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | nr chans   = 4096
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | nr corrs   = 4
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | sources per chunk = 343 (auto settings)
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | rows per chunk    = 34375 (auto settings)
# 2023-12-13 10:22:14 | INFO     | budget:get_budget | expected memory usage = 251.77 GB
# 2023-12-13 10:22:14 | INFO     | crystalball:_predict | Field J1939-6342 DDID 0 rows 944320 chans 4096 corrs 4
# Successful read/write open of default-locked table /stimela_mount/msdir/MACSJ2140.2-2339_UHF-cal1.ms: 26 columns, 1634880 rows
# 
# [######################################### ] | 99% Complete (Estimate) |  2m 0s / ~ 2m 0s
# [######################################### ] | 99% Complete (Estimate) |  2m 5s / ~ 2m 5s
# [######################################### ] | 99% Complete (Estimate) |  2m10s / ~ 2m10s
# [######################################### ] | 99% Complete (Estimate) |  2m15s / ~ 2m15s
# [######################################### ] | 99% Complete (Estimate) |  2m20s / ~ 2m20s**
...
# [######################################### ] | 99% Complete (Estimate) |  6h20m / ~ 6h21m
# [######################################### ] | 99% Complete (Estimate) |  6h20m / ~ 6h21m
# [######################################### ] | 99% Complete (Estimate) |  6h20m / ~ 6h21m
# [######################################### ] | 99% Complete (Estimate) |  6h20m / ~ 6h21m
# [######################################### ] | 99% Complete (Estimate) |  6h21m / ~ 6h22m
# [######################################### ] | 99% Complete (Estimate) |  6h21m / ~ 6h22m
# [######################################### ] | 99% Complete (Estimate) |  6h22m / ~ 6h22m
# [######################################### ] | 99% Complete (Estimate) |  6h22m / ~ 6h23m
# INFO:    Cleaning up image...
# [######################################### ] | 99% Complete (Estimate) |  6h23m / ~ 6h23m
2023-12-13 16:45:41 CARACal.Stimela.set_model_cal-0 ERROR: /usr/local/bin/singularity run --workdir /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/.stimela_workdir-1702455655271989 --containall --userns returns error code 1
2023-12-13 16:45:41 CARACal.Stimela.set_model_cal-0 ERROR: job failed at 2023-12-13 16:45:41.498229 after 6:23:35.548845
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR: Traceback (most recent call last):
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR:   File "/home/trehaeven/caracal_venv/lib/python3.8/site-packages/stimela/recipe.py", line 713, in run
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR:     job.run_job()
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR:   File "/home/trehaeven/caracal_venv/lib/python3.8/site-packages/stimela/recipe.py", line 425, in run_job
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR:     self.job.run(output_wrangler=self.apply_output_wranglers)
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR:   File "/home/trehaeven/caracal_venv/lib/python3.8/site-packages/stimela/singularity.py", line 128, in run
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR:     utils.xrun(f"{BINARY} run --workdir {self.execdir} --containall {extras}",
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR:   File "/home/trehaeven/caracal_venv/lib/python3.8/site-packages/stimela/utils/xrun_poll.py", line 227, in xrun
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR:     raise StimelaCabRuntimeError("{} returns error code {}".format(command_name, status))
2023-12-13 16:45:42 CARACal.Stimela.set_model_cal-0 ERROR: stimela.utils.StimelaCabRuntimeError: /usr/local/bin/singularity run --workdir /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/.stimela_workdir-1702455655271989 --containall --userns returns error code 1
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Completed jobs : ['version-mypipelinerun_crosscal_before-ms0']
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Remaining jobs : ['delay_cal-1gc1-0-0-primary', 'plotgains-K-0-0', 'gain_cal-1gc1-0-0-primary', 'plotgains-G-0-0', 'bp_cal-1gc1-0-0-primary', 'plotgains-B-0-0', 'apply_gains-fcal--0', 'auto_flagging-1gc1-0-0-primary', 'delay_cal-1gc1-1-0-primary', 'plotgains-K-0-0', 'gain_cal-1gc1-1-0-primary', 'plotgains-G-0-0', 'bp_cal-1gc1-1-0-primary', 'plotgains-B-0-0', 'delay_cal-1gc1-0-0-secondary', 'plotgains-K-0-1', 'gain_cal-1gc1-0-0-secondary', 'plotgains-G-0-1', 'apply_gains-gcal--0', 'auto_flagging-1gc1-0-0-secondary', 'delay_cal-1gc1-1-0-secondary', 'plotgains-K-0-1', 'copy_primary_gains_gaincal_for_Ftable-1gc1-0-0-secondary', 'gaincal_for_Ftable-1gc1-0-0-secondary', 'transfer_fluxscale-1gc1-0', 'plotgains-F-0-', 'apply_gains-gcal--0', 'image-1gc1-0-0-secondary-field1', 'apply_gains-bpcal-1gc1-0', 'apply_gains-gcal-1gc1-0', 'save-mypipelinerun_crosscal_after-ms0', 'summary-1gc1-0']
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: delay_cal-1gc1-0-0-primary:: K calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-K-0-0:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: gain_cal-1gc1-0-0-primary:: G calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-G-0-0:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: bp_cal-1gc1-0-0-primary:: B calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-B-0-0:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: apply_gains-fcal--0::Apply gain tables
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: auto_flagging-1gc1-0-0-primary::
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: delay_cal-1gc1-1-0-primary:: K calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-K-0-0:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: gain_cal-1gc1-1-0-primary:: G calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-G-0-0:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: bp_cal-1gc1-1-0-primary:: B calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-B-0-0:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: delay_cal-1gc1-0-0-secondary:: K calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-K-0-1:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: gain_cal-1gc1-0-0-secondary:: G calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-G-0-1:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: apply_gains-gcal--0::Apply gain tables
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: auto_flagging-1gc1-0-0-secondary::
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: delay_cal-1gc1-1-0-secondary:: K calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-K-0-1:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: gaincal_for_Ftable-1gc1-0-0-secondary:: Copy parimary gains
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: gaincal_for_Ftable-1gc1-0-0-secondary:: F calibration
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: Transfer fluxscale
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: plotgains-F-0-:: Plot gaincal phase
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: apply_gains-gcal--0::Apply gain tables
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: image-1gc1-0-0-secondary-field1:: Image secondary field
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: apply_gains-bpcal-1gc1-0::Apply gain tables
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: apply_gains-gcal-1gc1-0::Apply gain tables
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: save-mypipelinerun_crosscal_after-ms0:: Save flag version
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Logging remaining task: summary-1gc1-0:: Flagging summary  ms=MACSJ2140.2-2339_UHF-cal1.ms
2023-12-13 16:45:42 CARACal.Stimela.crosscal INFO: Saving pipeline information in .last_crosscal.json
2023-12-13 16:45:42 CARACal ERROR: Job 'set_model_cal-0:: Set jansky ms=MACSJ2140.2-2339_UHF-cal1.ms' failed: /usr/local/bin/singularity run --workdir /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/.stimela_workdir-1702455655271989 --containall --userns returns error code 1 [PipelineException]
2023-12-13 16:45:42 CARACal INFO:   More information can be found in the logfile at output/logs-20231213-102054/log-caracal.txt
2023-12-13 16:45:42 CARACal INFO:   You are running version 1.1.3
2023-12-13 16:45:42 CARACal ERROR: Traceback (most recent call last):
2023-12-13 16:45:42 CARACal ERROR:   File "/home/trehaeven/caracal_venv/lib/python3.8/site-packages/caracal/main.py", line 189, in __run
2023-12-13 16:45:42 CARACal ERROR:     pipeline.run_workers()
2023-12-13 16:45:42 CARACal ERROR:   File "/home/trehaeven/caracal_venv/lib/python3.8/site-packages/caracal/workers/worker_administrator.py", line 436, in run_workers
2023-12-13 16:45:42 CARACal ERROR:     worker.worker(self, recipe, config)
2023-12-13 16:45:42 CARACal ERROR:   File "/home/trehaeven/caracal_venv/lib/python3.8/site-packages/caracal/workers/crosscal_worker.py", line 729, in worker
2023-12-13 16:45:42 CARACal ERROR:     recipe.run()
2023-12-13 16:45:42 CARACal ERROR:   File "/home/trehaeven/caracal_venv/lib/python3.8/site-packages/stimela/recipe.py", line 764, in run
2023-12-13 16:45:42 CARACal ERROR:     raise PipelineException(exc, self.completed, job, self.remaining) from None
2023-12-13 16:45:42 CARACal ERROR: stimela.exceptions.PipelineException: Job 'set_model_cal-0:: Set jansky ms=MACSJ2140.2-2339_UHF-cal1.ms' failed: /usr/local/bin/singularity run --workdir /home/trehaeven/MACSJ2140.2-2339_UHF.same_caracal_as_L/.stimela_workdir-1702455655271989 --containall --userns returns error code 1
2023-12-13 16:45:42 CARACal INFO: exiting with error code 1
Athanaseus commented 2 weeks ago

Please also share the set_model_cal-0 or crystalball

ktrehaeven commented 2 weeks ago

log-crosscal-set_model_cal-0-20231213-204705.txt

I suspect its a problem with janis. I havent been able to do much of anything on janis over the past few months. or at least my environment on janis.

Athanaseus commented 2 weeks ago

Thanks for the feedback @ktrehaeven. I'm closing then, and if a related one comes up, feel free to report it.