Selfcal worker trips over empty source list

gigjozsa commented 4 years ago

This is using the standard suite of test data sets. Using the carateConfig.yml: carateConfig-docker.yml.txt I get the following report: carateConfig-docker-log-caracal.txt The selfcal worker tries to transfer the model and suddenly fails, likely because of an empty source list.

This has been run with the following configuration: carateConfig-docker-sysinfo.txt

And these are the relevant lines:

# running docker start -a transfer_model_field0_ms0-14062048178590415886920228526018
# 2020-05-05 15:20:29 | INFO     | crystalball:_predict | Crystalball version 0.3.0
# 2020-05-05 15:20:29 | INFO     | wsclean:import_from_wsclean | /stimela_mount/output/continuum/image_4/mypipelinerun_circinus_p1_4-sources.txt contains 0 components
# 2020-05-05 15:20:29 | INFO     | wsclean:import_from_wsclean | Total flux of 0 selected components is 0.000000 Jy
# 2020-05-05 15:20:29 | INFO     | ms:ms_preprocess | inserting new column MODEL_DATA
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | --------------------------------------------------
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | Budgeting
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | --------------------------------------------------
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | system RAM = 15.58 GB
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | nr of logical CPUs = 12
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | nr sources = 0
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | nr rows    = 47616
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | nr chans   = 50
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | nr corrs   = 2
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | sources per chunk = 0 (auto settings)
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | rows per chunk    = 47616 (auto settings)
# 2020-05-05 15:20:29 | INFO     | budget:get_budget | expected memory usage = 0.53 GB
# Successful read/write open of default-locked table /stimela_mount/msdir/1524929477-circinus_p1_corr.ms: 26 columns, 47616 rows
# Traceback (most recent call last):
#   File "/usr/local/bin/crystalball", line 8, in <module>
#     sys.exit(predict())
#   File "/usr/local/lib/python3.6/dist-packages/crystalball/crystalball.py", line 155, in predict
#     return _predict(args)
#   File "/usr/local/lib/python3.6/dist-packages/crystalball/crystalball.py", line 232, in _predict
#     frequency)
#   File "/usr/local/lib/python3.6/dist-packages/africanus/rime/dask_predict.py", line 482, in wsclean_predict
#     dtype=spectrum_dtype)
#   File "/usr/local/lib/python3.6/dist-packages/dask/array/blockwise.py", line 145, in blockwise
#     chunkss, arrays = unify_chunks(*args)
#   File "/usr/local/lib/python3.6/dist-packages/dask/array/core.py", line 3217, in unify_chunks
#     for n, j in enumerate(i)
#   File "/usr/local/lib/python3.6/dist-packages/dask/array/core.py", line 3217, in <genexpr>
#     for n, j in enumerate(i)
# IndexError: tuple index out of range
# Traceback (most recent call last):
#   File "/stimela_mount/code/run.py", line 40, in <module>
#     subprocess.check_call(shlex.split(_runc))
#   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
#     raise CalledProcessError(retcode, cmd)
# subprocess.CalledProcessError: Command '['crystalball', '--sky-model', '/stimela_mount/output/continuum/image_4/mypipelinerun_circinus_p1_4-sources.txt', '--output-column', 'MODEL_DATA', '--row-chunks', '0', '--model-chunks', '0', '--num-sources', '0', '--num-workers', '12', '--memory-fraction', '0.45', '/stimela_mount/msdir/1524929477-circinus_p1_corr.ms']' returned non-zero exit status 1.
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR: docker returns error code 1
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR: job failed at 2020-05-05 17:20:29.861607 after 0:00:03.963596
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR: Traceback (most recent call last):
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR:   File "/home/jozsa/software/caracal_tests/master/caracal_venv/lib/python3.7/site-packages/stimela/recipe.py", line 630, in run
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR:     job.run_job()
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR:   File "/home/jozsa/software/caracal_tests/master/caracal_venv/lib/python3.7/site-packages/stimela/recipe.py", line 365, in run_job
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR:     self.job.start(output_wrangler=self.apply_output_wranglers)
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR:   File "/home/jozsa/software/caracal_tests/master/caracal_venv/lib/python3.7/site-packages/stimela/docker.py", line 171, in start
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR:     kill_callback=lambda: utils.xrun("docker", ["kill", self.name]))
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR:   File "/home/jozsa/software/caracal_tests/master/caracal_venv/lib/python3.7/site-packages/stimela/utils/xrun_poll.py", line 189, in xrun
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR:     raise StimelaCabRuntimeError("{} returns error code {}".format(command_name, status))
2020-05-05 17:20:29 CARACal.Stimela.transfer_model-field0-ms0 ERROR: stimela.utils.StimelaCabRuntimeError: docker returns error code 1

gigjozsa commented 4 years ago

PPS: To reproduce please run a master caracal together with carateConfig and the "standard" input files measurement sets 1524929477.ms 1524947605.ms 1532022061.ms.

PeterKamphuis commented 4 years ago

@gigjozsa So do you have a model in image 4?

paoloserra commented 4 years ago

@PeterKamphuis sorry we were discussing this offline. Brief recap:

Up to image3 all is good. Then cubical3 flags >99% of the data, and image4 is empty (even the dirty image). NO wonder the source list is empty. This is all in the log attached by Josh above.

PeterKamphuis commented 4 years ago

@paoloserra Ah, can you try adding

  cal_cubical:
    madmax_flagging: false

to the config and see what happens? Looks like the last timescale is too short.

paoloserra commented 4 years ago

well, @gigjozsa would have to try that, I'm fine! :)

PeterKamphuis commented 4 years ago

@gigjozsa This seems to run fine with the following settings for selfcal:

selfcal:
  enable: true
  overwrite_flag_versions: true
  label: cont
  ncpu: 12
  img_npix: 3600
  img_cell: 2
  img_niter: 1000000
  img_nchans: 6
  img_robust: -1.0
  cal_niter: 3
  cal_cubical:
    channel_chunk: 0
    shared_memory: 250GB
    max_post_error: 0.25
  cal_timeslots_chunk: 360
  image:
    enable: true
    clean_mask_threshold: [15,10,5,5]
    clean_threshold: [0.5,0.5,0.5,0.5]
  calibrate:
    enable: true
    model: ['1','2','3']
    gain_matrix_type: ['GainDiagPhase', 'GainDiagPhase', 'GainDiagPhase']
    Gsols_channel: [0, 0, 0]
    Gsols_timeslots: [120, 60, 20]
  transfer_apply_gains:
    enable: true
    interpolate:
      timeslots_int: 20
      timeslots_chunk: 360
      channel_int: 0
      channel_chunk: 0
    transfer_to_label: corr
  transfer_model:
    enable: true
    num_workers: 12
    memory_fraction: 0.45
    transfer_to_label: corr

I'm not sure what you want to do about this as it is basically a bad data set with bad config settings. We have discussed it before but I'm not sure we arrived to a conclusion. If you use pybdsm the pipeline will already gracefully exit if it finds no sources. Obviously we need to adjust the carateConfig.yml to something sensible but whether we do anything else when the model field is empty is another question.

BTW In my original test the trip up was already in the first iteration, however it then recovered in the second and went flagging mental again in the third (i.e. images 1 and 3 were fine, 2 and 4 were empty).

I don't understand why this has changed recently and can only think of a change in the inner workings of cubical or that it has always been a close call and the new crosscal has changed things slightly in the input image. I don't think the default in the selfcal worker have changed recently. But as I said I'm just guessing on that.

gigjozsa commented 4 years ago

Thank you, yes, I was playing around with that, too. Let us for now push an adjusted carateConfig file. I will sneak it in and retest. But the discussion to have representative config files together with data sets to test is of course open.

PeterKamphuis commented 4 years ago

Well I was more wondering about the question whether we should exit the pipeline when a image is empty.

gigjozsa commented 4 years ago

We should if a data set gets fully flagged with a note.

caracal-pipeline / caracal

Selfcal worker trips over empty source list #1044