Closed KshitijT closed 2 years ago
This probably needs a simple fix in the Stimela cab. Which cab actually does this? It's quite easy to propagate errors out of the cab now.
This probably needs a simple fix in the Stimela cab. Which cab actually does this? It's quite easy to propagate errors out of the cab now.
Must be It is the simulator cab.
Meanwhile, reducing the tile-size helps. :)
Thanks for the suggestion @o-smirnov !
Just reporting that more users are encountering this problem. It would be nice to fix this soon.
To be honest here - and this has been my suggestion for a long time:
Can we please drop the lsms here (delete them from the repository) and replace with crystalball and the models I sent earlier this year for pks 1934 and pks 0407 in L and U band? I'm going to work with sphe next week to fix some of these niggles. So I suggest we drop that step and replace.
Cheers,
On Thu, Oct 28, 2021 at 2:02 PM paoloserra @.***> wrote:
Just reporting that more users are encountering this problem. It would be nice to fix this soon.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/caracal-pipeline/caracal/issues/1316#issuecomment-953778338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6UVWPWVPKFP6SPERY3UJFCWRANCNFSM4ZGHYBUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Benjamin Hugo
PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University
Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town
Thanks for the comments @bennahugo . I think that we've got an issue on replacing the lsms, see https://github.com/caracal-pipeline/caracal/issues/1357 .
While that improvement is being implemented, if there is a way to fix the present issue without spending too much time on it that would be good for current users.
While that improvement is being implemented, if there is a way to fix the present issue without spending too much time on it that would be good for current users.
Keep a really tiny default tile-size?
Fine with me, since it's just a temporary patch. Maybe we can discuss at the telecon next week.
Reminder to discuss this at today's telecon.
@SpheMakh does the new stimela release fix this? Otherwise I suggest we take it up for the next milestone (rather than 1.0.6).
@SpheMakh , I get a weird error with Stimela 1.7.3 now (when I use the simulator
cab to predict):
2022-02-08 14:17:33 CARACal.Stimela.set_model_cal-0 INFO: job started at 2022-02-08 14:17:33.560000
# WARNING: Overriding HOME environment variable with SINGULARITYENV_HOME is not permitted
# WARNING: passwd file doesn't exist in container, not updating
# WARNING: group file doesn't exist in container, not updating
# open /.singularity.d/env: permission denied
2022-02-08 14:17:33 CARACal.Stimela.set_model_cal-0 ERROR: cd /idia/users/kshitij/3C403/.stimela_workdir-16443226070845108 && singularity returns error code 1
2022-02-08 14:17:33 CARACal.Stimela.set_model_cal-0 ERROR: job failed at 2022-02-08 14:17:33.803227 after 0:00:00.243227
2022-02-08 14:17:33 CARACal ERROR: Job 'set_model_cal-0:: Set jansky ms=1565287561_sdp_l0-cal.ms' failed: cd /idia/users/kshitij/3C403/.stimela_workdir-16443226070845108 && singularity returns error code 1 [PipelineException]
That's weird, in my latest CARACal runs on ilifu I don't see this.
Do you get this error always, or does it only happen with the settings that were giving you the original error (i.e., with large tile size)?
Here's my log.
2022-02-03 10:24:11 CARACal INFO: Invoked as /idia/projects/fornax/caracal-venv/bin/caracal -c xcal.yml -sid /software/astro/caracal/STIMELA_IMAGES_1.7.3/
[...]
2022-02-03 10:24:12 CARACal INFO: crosscal:
[...]
2022-02-03 10:24:12 CARACal INFO: set_model:
2022-02-03 10:24:12 CARACal INFO: enable: True
2022-02-03 10:24:12 CARACal INFO: meerkat_band: L
2022-02-03 10:24:12 CARACal INFO: meerkat_skymodel: True
2022-02-03 10:24:12 CARACal INFO: meerkat_crystalball_skymodel: False
2022-02-03 10:24:12 CARACal INFO: meerkat_crystalball_memory_fraction:0.5
2022-02-03 10:24:12 CARACal INFO: meerkat_crystalball_num_sources:0
2022-02-03 10:24:12 CARACal INFO: meerkat_crystalball_row_chunks:0
2022-02-03 10:24:12 CARACal INFO: no_verify: False
2022-02-03 10:24:12 CARACal INFO: field: fcal
2022-02-03 10:24:12 CARACal INFO: tile_size: 16
2022-02-03 10:24:12 CARACal INFO: threads: 8
[...]
2022-02-03 11:12:35 CARACal.Stimela.set_model_cal-0 INFO: job started at 2022-02-03 11:12:35.964135
2022-02-03 11:12:35 CARACal.Stimela.crosscal INFO: STEP 2 :: set_model_cal-0:: Set jansky ms=1643392282_sdp_l0-cal.ms
2022-02-03 11:12:35 CARACal.Stimela INFO: Validating parameters for cab simulator (stimela/meqtrees:1.7.2)
2022-02-03 11:12:35 CARACal.Stimela INFO: Parameters validated and saved to /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865/stimela_parameter_files/set_model_cal_0-14053199808856016438795406910446.json
2022-02-03 11:12:35 CARACal.Stimela.set_model_cal-0 INFO: Starting container [set_model_cal_0-14053199808856016438795406910446]. Timeout set to -1. The container ID is printed below.
# running cd /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865 && singularity run --workdir /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865 --containall --bind /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865/stimela_parameter_files/set_model_cal_0-14053199808856016438795406910446.json:/stimela_mount/configfile:ro --bind /idia/projects/fornax/caracal-venv/lib/python3.8/site-packages/stimela/cargo/cab/simulator/src:/stimela_mount/code:ro --bind /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865/passwd:/etc/passwd:rw --bind /scratch3/projects/fornax/data_reduction/mfs46/.stimela_workdir-16438795405542865/group:/etc/group:rw --bind /idia/projects/fornax/caracal-venv/bin/stimela_runscript:/singularity:ro --bind /scratch3/projects/fornax/data_reduction/mfs46/msdir:/stimela_mount/msdir:rw --bind /scratch3/projects/fornax/data_reduction/mfs46/input:/stimela_mount/input:ro --bind /scratch3/projects/fornax/data_reduction/mfs46/output:/stimela_mount/output:rw --bind /scratch3/projects/fornax/data_reduction/mfs46/output/tmp:/stimela_mount/output/tmp:rw /software/astro/caracal/STIMELA_IMAGES_1.7.3/stimela_meqtrees_1.7.2.sif /singularity
# WARNING: Overriding HOME environment variable with SINGULARITYENV_HOME is not permitted
# 2022-02-03 11:12:37: Initial memory state:
# 2022-02-03 11:12:37: total used free shared buff/cache available
# 2022-02-03 11:12:37: Mem: 236G 1.7G 91G 1.3M 142G 232G
# 2022-02-03 11:12:37: Swap: 0B 0B 0B
# 2022-02-03 11:12:37: Running meqtree-pipeliner.py --mt 8 -c /stimela_mount/code/tdlconf.profiles [sim] ms_sel.ddid_index=0 ms_sel.field_index=0 me.sky.tiggerskymodel=1 tiggerlsm.filename=/stimela_mount/input/pks0407-65.lsm ms_sel.msname=/stimela_mount/msdir/1643392282_sdp_l0-cal.ms sim_mode="sim only" ms_sel.input_column=DATA ms_sel.output_column=MODEL_DATA me.use_smearing=0 ms_sel.tile_size=16 /usr/local/lib/python3.6/dist-packages/Cattery/Siamese/turbo-sim.py =_simulate_MS
# ### Starting meqserver
# ('### Attaching to configuration file', '/stimela_mount/code/tdlconf.profiles')
# ('### Loading config section', 'sim')
# ### Setting option ms_sel.ddid_index=0
# ### Setting option ms_sel.field_index=0
# ### Setting option me.sky.tiggerskymodel=1
# ### Setting option tiggerlsm.filename=/stimela_mount/input/pks0407-65.lsm
# ### Setting option ms_sel.msname=/stimela_mount/msdir/1643392282_sdp_l0-cal.ms
# ### Setting option sim_mode=sim only
# ### Setting option ms_sel.input_column=DATA
# ### Setting option ms_sel.output_column=MODEL_DATA
# ### Setting option me.use_smearing=0
# ### Setting option ms_sel.tile_size=16
# ('### Compiling', '/usr/local/lib/python3.6/dist-packages/Cattery/Siamese/turbo-sim.py')
# ### (using previously set options)
# (Meow.MSUtils: found /usr/bin/lwimager, can use it for imaging.)
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/ANTENNA: 8 columns, 64 rows
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/OBSERVATION: 9 columns, 1 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/POLARIZATION: 4 columns, 1 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/DATA_DESCRIPTION: 3 columns, 1 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/SPECTRAL_WINDOW: 14 columns, 1 rows
# Successful readonly open of autonoread-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms/FIELD: 9 columns, 2 rows
# (Meow.MSUtils: found /usr/bin/lwimager, can use it for imaging.)
# Using LSM module from Tigger (1.4.2) 1.4.2 at /usr/local/lib/python3.6/dist-packages/Tigger (in path)
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Successful readonly open of default-locked table /stimela_mount/msdir/1643392282_sdp_l0-cal.ms: 23 columns, 330720 rows
# Loading /stimela_mount/input/pks0407-65.lsm: ASCII table
# ('### ', 'TDL script successfully compiled. 8649 node definitions\n (of which 2 are root nodes) sent to meqserver.')
# ### Running TDL job "_simulate_MS"
# /usr/local/lib/python3.6/dist-packages/matplotlib/backends/qt_editor/figureoptions.py:11: MatplotlibDeprecationWarning:
# The support for Qt4 was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
# from matplotlib.backends.qt_compat import QtGui
# ('### Job result:', None)
# ### Stopping the meqserver
# ### All your batch are belong to us. Bye!
2022-02-03 11:22:37 CARACal.Stimela.set_model_cal-0 INFO: job complete at 2022-02-03 11:22:37.397696 after 0:10:01.433561
Hmmm, no, the error I got was tile-size agnostic.
ok, so why am I not getting it?!?
I can't reproduce the error either, even on ilifu.
Stimela also crashes when meqtrees does
# ### Running TDL job "_simulate_MS"
# /usr/local/lib/python3.6/dist-packages/matplotlib/backends/qt_editor/figureoptions.py:11: MatplotlibDeprecationWarning:
# The support for Qt4 was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
# from matplotlib.backends.qt_compat import QtGui
# ('### Job result:', None)
# ### meqserver reported 1 error(s) during the run:
# ### 000: node 'VisDataMux': execute() failed: input dataset missing a footer (return code 0x810021)
# ### Stopping the meqserver
# ### All your batch are not belong to us, returning with error code
# 2022-02-08 17:01:43: meqtree-pipeliner.py exited with code 1
# Traceback (most recent call last):
# File "/stimela_mount/code/run.py", line 168, in <module>
# raise RuntimeError(f"Meqtrees failed with exit code {exitcode}. See logs for more details")
# RuntimeError: Meqtrees failed with exit code Command '['meqtree-pipeliner.py', '--mt', '4', '-c', '/stimela_mount/code/tdlconf.profiles', '[sim]', 'ms_sel.ddid_index=0', 'ms_sel.field_index=0', 'me.sky.tiggerskymodel=1', 'tiggerlsm.filename=/stimela_mount/input/nvss1deg.lsm.html', 'ms_sel.msname=/stimela_mount/msdir/meerkat_simulation_example.ms', 'sim_mode=sim only', 'ms_sel.input_column=DATA', 'ms_sel.output_column=DATA', 'me.use_smearing=0', 'ms_sel.tile_size=64', 'me.g_enable=1', 'oms_gain_models.err-gain.error_model=SineError', 'oms_gain_models.err-gain.max_period=30', 'oms_gain_models.err-gain.maxval=1.2', 'oms_gain_models.err-gain.min_period=5', 'oms_gain_models.err-gain.minval=0.8', 'oms_gain_models.err-phase.error_model=SineError', 'oms_gain_models.err-phase.max_period=30', 'oms_gain_models.err-phase.maxerr=1.2', 'oms_gain_models.err-phase.min_period=5', 'oms_gain_models.err-phase.minval=0.8', '/usr/local/lib/python3.6/dist-packages/Cattery/Siamese/turbo-sim.py', '=_simulate_MS']' returned non-zero exit status 1.. See logs for more details
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: cd /home/sphemakh/work/softdev/Stimela/examples/.stimela_workdir-16443324842710488 && singularity returns error code 1
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: job failed at 2022-02-08 17:01:43.654532 after 0:00:19.019623
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: Traceback (most recent call last):
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: File "/home/sphemakh/work/softdev/Stimela/examples/tenv/lib/python3.8/site-packages/stimela/recipe.py", line 713, in run
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: job.run_job()
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: File "/home/sphemakh/work/softdev/Stimela/examples/tenv/lib/python3.8/site-packages/stimela/recipe.py", line 425, in run_job
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: self.job.run(output_wrangler=self.apply_output_wranglers)
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: File "/home/sphemakh/work/softdev/Stimela/examples/tenv/lib/python3.8/site-packages/stimela/singularity.py", line 124, in run
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: utils.xrun(f"cd {self.execdir} && singularity", ["run", "--workdir", self.execdir, "--containall"] \
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: File "/home/sphemakh/work/softdev/Stimela/examples/tenv/lib/python3.8/site-packages/stimela/utils/xrun_poll.py", line 227, in xrun
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: raise StimelaCabRuntimeError("{} returns error code {}".format(command_name, status))
2022-02-08 17:01:43 STIMELA.simulator_example ERROR: stimela.utils.StimelaCabRuntimeError: cd /home/sphemakh/work/softdev/Stimela/examples/.stimela_workdir-16443324842710488 && singularity returns error code 1
2022-02-08 17:01:43 STIMELA.Simulation Example INFO: Completed jobs : []
2022-02-08 17:01:43 STIMELA.Simulation Example INFO: Remaining jobs : ['cal_example', 'imager_example_robust_0', 'imager_example_robust_1', 'imager_example_robust_2', 'delete_ms']
Hmmm, then by consensus, let's call this a "feature" of my setup on Ilifu and close the issue. :)
While running CARACal on data from two different observations (for the same source) one of them had a silent crash of meqserver while predicting the BPCAL (0408) custom model. As you can see below, the pipeline continues pretty nonchalantly to delaycal :
Cue imaging of the target - and several days of debugging (to my shame) - when I find out that the two images have a completely different flux scale for the same sources (@o-smirnov , it reminds me of the issue you had with the Open time observations).
Can we have a hard crash of the pipeline when this happens? Any suggestions on how to implement it?