caracal-pipeline / caracal

Containerized Automated Radio Astronomy Calibration (CARACal) pipeline
GNU General Public License v2.0
28 stars 6 forks source link

Pipeline should crash when meqserver dies #1316

Closed KshitijT closed 2 years ago

KshitijT commented 3 years ago

While running CARACal on data from two different observations (for the same source) one of them had a silent crash of meqserver while predicting the BPCAL (0408) custom model. As you can see below, the pipeline continues pretty nonchalantly to delaycal :

# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7f6510c9c490>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7f6510c2a7d0>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7f6510c2a5d0>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7f6510c2a7d0>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7f6510c9c790>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7f6510c2a7d0>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7f6510c2a4d0>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7f6510c2a7d0>
# ###  TDL script successfully compiled. 7197 node definitions
#   (of which 2 are root nodes) sent to meqserver.
# ### Running TDL job "_simulate_MS"
# Traceback (most recent call last):
#   File "/usr/bin/meqtree-pipeliner.py", line 176, in <module>
#     res = func(mqs,None,wait=True);
#   File "/usr/local/lib/python3.6/dist-packages/Cattery/Siamese/turbo-sim.py", line 239, in _simulate_MS
#     mqs.execute('VisDataMux',mssel.create_io_request(),wait=wait);
#   File "/usr/lib/python2.7/dist-packages/Timba/Apps/meqserver.py", line 173, in execute
#     return self.meq('Node.Execute',rec,wait=wait);
#   File "/usr/lib/python2.7/dist-packages/Timba/Apps/meqserver.py", line 126, in meq
#     msg = self.await(replyname,resume=True,timeout=wait);
#   File "/usr/lib/python2.7/dist-packages/Timba/Apps/multiapp_proxy.py", line 524, in await
#     raise RuntimeError,"lost all connections while waiting for event "+str(what);
# RuntimeError: lost all connections while waiting for event Result.Node.execute.1
# ### Job terminated with exception:
# ### No more commands
# ### The meqserver appears to have died on us :( Please check for core files and such.
# ### All your batch are not belong to us, returning with error code
# 2021-02-19 12:49:22: meqtree-pipeliner.py exited with code 1
2021-02-19 14:49:22 CARACal.Stimela.set_model_cal-0 INFO: job complete at 2021-02-19 14:49:22.816941 after 0:13:29.051764
2021-02-19 14:49:22 CARACal.Stimela.delay_cal-1gc1-0-0-primary INFO: job started at 2021-02-19 14:49:22.834746
2021-02-19 14:49:22 CARACal.Stimela.crosscal INFO: STEP 3 :: delay_cal-1gc1-0-0-primary:: K calibration
2021-02-19 14:49:22 CARACal.Stimela INFO: Validating parameters for cab casa_gaincal (stimela/casa:1.6.3)
2021-02-19 14:49:22 CARACal.Stimela INFO: Location of 'caltable' was specified as 'output'. Will overide default.
2021-02-19 14:49:22 CARACal.Stimela INFO: Parameters validated and saved to /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/.stimela_workdir-16137381260674815/stimela_parameter_files/delay_cal_1gc1_0_0_primary-13977077200232816137381262443593.json
2021-02-19 14:49:22 CARACal.Stimela.delay_cal-1gc1-0-0-primary INFO: Starting container [delay_cal_1gc1_0_0_primary-13977077200232816137381262443593]. Timeout set to -1. The container ID is printed below.
# running cd /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/.stimela_workdir-16137381260674815 && singularity run --workdir /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/.stimela_workdir-16137381260674815 --containall  --bind /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/.stimela_workdir-16137381260674815/stimela_parameter_files/delay_cal_1gc1_0_0_primary-13977077200232816137381262443593.json:/stimela_mount/configfile:ro --bind /scratch/users/kshitij/caracal_release_venv3/lib/python3.6/site-packages/stimela/cargo/cab/casa_gaincal/src:/stimela_mount/code:ro --bind /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/.stimela_workdir-16137381260674815/passwd:/etc/passwd:rw --bind /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/.stimela_workdir-16137381260674815/group:/etc/group:rw --bind /scratch/users/kshitij/caracal_release_venv3/bin/stimela_runscript:/singularity:ro --bind /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/msdir:/stimela_mount/msdir:rw --bind /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/input:/stimela_mount/input:ro --bind /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/output_february_final_final_trimmed/caltables:/stimela_mount/output:rw --bind /scratch/users/kshitij/X-shaped-galaxy/Sarah_Gleam_source/output_february_final_final_trimmed/caltables/tmp:/stimela_mount/output/tmp:rw /software/astro/caracal/STIMELA_IMAGES_1.6.8/stimela_casa_1.6.3.sif /singularity
# 2021-02-19 12:49:35: Initial memory state:
# 2021-02-19 12:49:35:               total        used        free      shared  buff/cache   available
# 2021-02-19 12:49:35: Mem:           236G        1.7G        231G        1.1M        3.2G        232G
# 2021-02-19 12:49:35: Swap:            0B          0B          0B
#
# =========================================
# The start-up time of CASA may vary
# depending on whether the shared libraries
# are cached or not.
# =========================================

Cue imaging of the target - and several days of debugging (to my shame) - when I find out that the two images have a completely different flux scale for the same sources (@o-smirnov , it reminds me of the issue you had with the Open time observations).

Can we have a hard crash of the pipeline when this happens? Any suggestions on how to implement it?

o-smirnov commented 3 years ago

This probably needs a simple fix in the Stimela cab. Which cab actually does this? It's quite easy to propagate errors out of the cab now.

KshitijT commented 3 years ago

This probably needs a simple fix in the Stimela cab. Which cab actually does this? It's quite easy to propagate errors out of the cab now.

Must be It is the simulator cab.

KshitijT commented 3 years ago

Meanwhile, reducing the tile-size helps. :)

KshitijT commented 3 years ago

Thanks for the suggestion @o-smirnov !

paoloserra commented 2 years ago

Just reporting that more users are encountering this problem. It would be nice to fix this soon.

bennahugo commented 2 years ago

To be honest here - and this has been my suggestion for a long time:

Can we please drop the lsms here (delete them from the repository) and replace with crystalball and the models I sent earlier this year for pks 1934 and pks 0407 in L and U band? I'm going to work with sphe next week to fix some of these niggles. So I suggest we drop that step and replace.

Cheers,

On Thu, Oct 28, 2021 at 2:02 PM paoloserra @.***> wrote:

Just reporting that more users are encountering this problem. It would be nice to fix this soon.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/caracal-pipeline/caracal/issues/1316#issuecomment-953778338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6UVWPWVPKFP6SPERY3UJFCWRANCNFSM4ZGHYBUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

paoloserra commented 2 years ago

Thanks for the comments @bennahugo . I think that we've got an issue on replacing the lsms, see https://github.com/caracal-pipeline/caracal/issues/1357 .

While that improvement is being implemented, if there is a way to fix the present issue without spending too much time on it that would be good for current users.

KshitijT commented 2 years ago

While that improvement is being implemented, if there is a way to fix the present issue without spending too much time on it that would be good for current users.

Keep a really tiny default tile-size?

paoloserra commented 2 years ago

Fine with me, since it's just a temporary patch. Maybe we can discuss at the telecon next week.

KshitijT commented 2 years ago

Reminder to discuss this at today's telecon.

KshitijT commented 2 years ago

@SpheMakh does the new stimela release fix this? Otherwise I suggest we take it up for the next milestone (rather than 1.0.6).

KshitijT commented 2 years ago

@SpheMakh , I get a weird error with Stimela 1.7.3 now (when I use the simulator cab to predict):

2022-02-08 14:17:33 CARACal.Stimela.set_model_cal-0 INFO: job started at 2022-02-08 14:17:33.560000
# WARNING: Overriding HOME environment variable with SINGULARITYENV_HOME is not permitted
# WARNING: passwd file doesn't exist in container, not updating
# WARNING: group file doesn't exist in container, not updating
# open /.singularity.d/env: permission denied
2022-02-08 14:17:33 CARACal.Stimela.set_model_cal-0 ERROR: cd /idia/users/kshitij/3C403/.stimela_workdir-16443226070845108 && singularity returns error code 1
2022-02-08 14:17:33 CARACal.Stimela.set_model_cal-0 ERROR: job failed at 2022-02-08 14:17:33.803227 after 0:00:00.243227
2022-02-08 14:17:33 CARACal ERROR: Job 'set_model_cal-0:: Set jansky ms=1565287561_sdp_l0-cal.ms' failed: cd /idia/users/kshitij/3C403/.stimela_workdir-16443226070845108 && singularity returns error code 1 [PipelineException]
paoloserra commented 2 years ago

That's weird, in my latest CARACal runs on ilifu I don't see this.

Do you get this error always, or does it only happen with the settings that were giving you the original error (i.e., with large tile size)?

Here's my log.


2022-02-03 10:24:11 CARACal INFO: Invoked as /idia/projects/fornax/caracal-venv/bin/caracal -c xcal.yml -sid /software/astro/caracal/STIMELA_IMAGES_1.7.3/
[...]
2022-02-03 10:24:12 CARACal INFO:   crosscal:
[...]
2022-02-03 10:24:12 CARACal INFO:     set_model:
2022-02-03 10:24:12 CARACal INFO:       enable:                       True
2022-02-03 10:24:12 CARACal INFO:       meerkat_band:                 L
2022-02-03 10:24:12 CARACal INFO:       meerkat_skymodel:             True
2022-02-03 10:24:12 CARACal INFO:       meerkat_crystalball_skymodel: False
2022-02-03 10:24:12 CARACal INFO:       meerkat_crystalball_memory_fraction:0.5
2022-02-03 10:24:12 CARACal INFO:       meerkat_crystalball_num_sources:0
2022-02-03 10:24:12 CARACal INFO:       meerkat_crystalball_row_chunks:0
2022-02-03 10:24:12 CARACal INFO:       no_verify:                    False
2022-02-03 10:24:12 CARACal INFO:       field:                        fcal
2022-02-03 10:24:12 CARACal INFO:       tile_size:                    16
2022-02-03 10:24:12 CARACal INFO:       threads:                      8
[...]
2022-02-03 11:12:35 CARACal.Stimela.set_model_cal-0 INFO: job started at 2022-02-03 11:12:35.964135
2022-02-03 11:12:35 CARACal.Stimela.crosscal INFO: STEP 2 :: set_model_cal-0:: Set jansky ms=1643392282_sdp_l0-cal.ms
2022-02-03 11:12:35 CARACal.Stimela INFO: Validating parameters for cab simulator (stimela/meqtrees:1.7.2)
2022-02-03 11:12:35 CARACal.Stimela INFO: Parameters validated and saved to /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865/stimela_parameter_files/set_model_cal_0-14053199808856016438795406910446.json
2022-02-03 11:12:35 CARACal.Stimela.set_model_cal-0 INFO: Starting container [set_model_cal_0-14053199808856016438795406910446]. Timeout set to -1. The container ID is printed below.
# running cd /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865 && singularity run --workdir /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865 --containall  --bind /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865/stimela_parameter_files/set_model_cal_0-14053199808856016438795406910446.json:/stimela_mount/configfile:ro --bind /idia/projects/fornax/caracal-venv/lib/python3.8/site-packages/stimela/cargo/cab/simulator/src:/stimela_mount/code:ro --bind /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865/passwd:/etc/passwd:rw --bind /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865/group:/etc/group:rw --bind /idia/projects/fornax/caracal-venv/bin/stimela_runscript:/singularity:ro --bind /scratch3/projects/fornax/data_reduction/mfs46/msdir:/stimela_mount/msdir:rw --bind /scratch3/projects/fornax/data_reduction/mfs46/input:/stimela_mount/input:ro --bind /scratch3/projects/fornax/data_reduction/mfs46/output:/stimela_mount/output:rw --bind /scratch3/projects/fornax/data_reduction/mfs46/output/tmp:/stimela_mount/output/tmp:rw /software/astro/caracal/STIMELA_IMAGES_1.7.3/stimela_meqtrees_1.7.2.sif /singularity
# WARNING: Overriding HOME environment variable with SINGULARITYENV_HOME is not permitted
# 2022-02-03 11:12:37: Initial memory state:
# 2022-02-03 11:12:37:               total        used        free      shared  buff/cache   available
# 2022-02-03 11:12:37: Mem:           236G        1.7G         91G        1.3M        142G        232G
# 2022-02-03 11:12:37: Swap:            0B          0B          0B
# 2022-02-03 11:12:37: Running meqtree-pipeliner.py  --mt 8 -c /stimela_mount/code/tdlconf.profiles [sim] ms_sel.ddid_index=0 ms_sel.field_index=0 me.sky.tiggerskymodel=1 tiggerlsm.filename=/stimela_mount/input/pks0407-65.lsm ms_sel.msname=/stimela_mount/msdir/1643392282_sdp_l0-cal.ms sim_mode="sim only" ms_sel.input_column=DATA ms_sel.output_column=MODEL_DATA me.use_smearing=0 ms_sel.tile_size=16 /usr/local/lib/python3.6/dist-packages/Cattery/Siamese/turbo-sim.py =_simulate_MS
# ### Starting meqserver
# ('### Attaching to configuration file', '/stimela_mount/code/tdlconf.profiles')
# ('### Loading config section', 'sim')
# ### Setting option ms_sel.ddid_index=0
# ### Setting option ms_sel.field_index=0
# ### Setting option me.sky.tiggerskymodel=1
# ### Setting option tiggerlsm.filename=/stimela_mount/input/pks0407-65.lsm
# ### Setting option ms_sel.msname=/stimela_mount/msdir/1643392282_sdp_l0-cal.ms
# ### Setting option sim_mode=sim only
# ### Setting option ms_sel.input_column=DATA
# ### Setting option ms_sel.output_column=MODEL_DATA
# ### Setting option me.use_smearing=0
# ### Setting option ms_sel.tile_size=16
# ('### Compiling', '/usr/local/lib/python3.6/dist-packages/Cattery/Siamese/turbo-sim.py')
# ### (using previously set options)
#   (Meow.MSUtils: found /usr/bin/lwimager, can use it for imaging.)
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/ANTENNA: 8 columns, 64 rows
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/OBSERVATION: 9 columns, 1 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/POLARIZATION: 4 columns, 1 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/DATA_DESCRIPTION: 3 columns, 1 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/SPECTRAL_WINDOW: 14 columns, 1 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/FIELD: 9 columns, 2 rows
#   (Meow.MSUtils: found /usr/bin/lwimager, can use it for imaging.)
# Using LSM module from Tigger (1.4.2) 1.4.2 at /usr/local/lib/python3.6/dist-packages/Tigger (in path)
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Loading /stimela_mount/input/pks0407-65.lsm: ASCII table
# ('### ', 'TDL script successfully compiled. 8649 node definitions\n  (of which 2 are root nodes) sent to meqserver.')
# ### Running TDL job "_simulate_MS"
# /usr/local/lib/python3.6/dist-packages/matplotlib/backends/qt_editor/figureoptions.py:11: MatplotlibDeprecationWarning:
# The support for Qt4  was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
#   from matplotlib.backends.qt_compat import QtGui
# ('### Job result:', None)
# ### Stopping the meqserver
# ### All your batch are belong to us. Bye!
2022-02-03 11:22:37 CARACal.Stimela.set_model_cal-0 INFO: job complete at 2022-02-03 11:22:37.397696 after 0:10:01.433561
KshitijT commented 2 years ago

Hmmm, no, the error I got was tile-size agnostic.

paoloserra commented 2 years ago

ok, so why am I not getting it?!?

SpheMakh commented 2 years ago

I can't reproduce the error either, even on ilifu.

SpheMakh commented 2 years ago

Stimela also crashes when meqtrees does

# ### Running TDL job "_simulate_MS"
# /usr/local/lib/python3.6/dist-packages/matplotlib/backends/qt_editor/figureoptions.py:11: MatplotlibDeprecationWarning:
# The support for Qt4  was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
#   from matplotlib.backends.qt_compat import QtGui
# ('### Job result:', None)
# ### meqserver reported 1 error(s) during the run:
# ###   000: node 'VisDataMux': execute() failed: input dataset missing a footer (return code 0x810021)
# ### Stopping the meqserver
# ### All your batch are not belong to us, returning with error code
# 2022-02-08 17:01:43: meqtree-pipeliner.py exited with code 1
# Traceback (most recent call last):
#   File "/stimela_mount/code/run.py", line 168, in <module>
#     raise RuntimeError(f"Meqtrees failed with exit code {exitcode}. See logs for more details")
# RuntimeError: Meqtrees failed with exit code Command '['meqtree-pipeliner.py', '--mt', '4', '-c', '/stimela_mount/code/tdlconf.profiles', '[sim]', 'ms_sel.ddid_index=0', 'ms_sel.field_index=0', 'me.sky.tiggerskymodel=1', 'tiggerlsm.filename=/stimela_mount/input/nvss1deg.lsm.html', 'ms_sel.msname=/stimela_mount/msdir/meerkat_simulation_example.ms', 'sim_mode=sim only', 'ms_sel.input_column=DATA', 'ms_sel.output_column=DATA', 'me.use_smearing=0', 'ms_sel.tile_size=64', 'me.g_enable=1', 'oms_gain_models.err-gain.error_model=SineError', 'oms_gain_models.err-gain.max_period=30', 'oms_gain_models.err-gain.maxval=1.2', 'oms_gain_models.err-gain.min_period=5', 'oms_gain_models.err-gain.minval=0.8', 'oms_gain_models.err-phase.error_model=SineError', 'oms_gain_models.err-phase.max_period=30', 'oms_gain_models.err-phase.maxerr=1.2', 'oms_gain_models.err-phase.min_period=5', 'oms_gain_models.err-phase.minval=0.8', '/usr/local/lib/python3.6/dist-packages/Cattery/Siamese/turbo-sim.py', '=_simulate_MS']' returned non-zero exit status 1.. See logs for more details
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: cd /home/sphemakh/work/softdev/Stimela/examples/.stimela_workdir-16443324842710488 && singularity returns error code 1
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: job failed at 2022-02-08 17:01:43.654532 after 0:00:19.019623
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: Traceback (most recent call last):
2022-02-08 17:01:43 STIMELA.simulator_example ERROR:   File "/home/sphemakh/work/softdev/Stimela/examples/tenv/lib/python3.8/site-packages/stimela/recipe.py", line 713, in run
2022-02-08 17:01:43 STIMELA.simulator_example ERROR:     job.run_job()
2022-02-08 17:01:43 STIMELA.simulator_example ERROR:   File "/home/sphemakh/work/softdev/Stimela/examples/tenv/lib/python3.8/site-packages/stimela/recipe.py", line 425, in run_job
2022-02-08 17:01:43 STIMELA.simulator_example ERROR:     self.job.run(output_wrangler=self.apply_output_wranglers)
2022-02-08 17:01:43 STIMELA.simulator_example ERROR:   File "/home/sphemakh/work/softdev/Stimela/examples/tenv/lib/python3.8/site-packages/stimela/singularity.py", line 124, in run
2022-02-08 17:01:43 STIMELA.simulator_example ERROR:     utils.xrun(f"cd {self.execdir} && singularity", ["run", "--workdir", self.execdir, "--containall"] \
2022-02-08 17:01:43 STIMELA.simulator_example ERROR:   File "/home/sphemakh/work/softdev/Stimela/examples/tenv/lib/python3.8/site-packages/stimela/utils/xrun_poll.py", line 227, in xrun
2022-02-08 17:01:43 STIMELA.simulator_example ERROR:     raise StimelaCabRuntimeError("{} returns error code {}".format(command_name, status))
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: stimela.utils.StimelaCabRuntimeError: cd /home/sphemakh/work/softdev/Stimela/examples/.stimela_workdir-16443324842710488 && singularity returns error code 1
2022-02-08 17:01:43 STIMELA.Simulation Example INFO: Completed jobs : []
2022-02-08 17:01:43 STIMELA.Simulation Example INFO: Remaining jobs : ['cal_example', 'imager_example_robust_0', 'imager_example_robust_1', 'imager_example_robust_2', 'delete_ms']
KshitijT commented 2 years ago

Hmmm, then by consensus, let's call this a "feature" of my setup on Ilifu and close the issue. :)