caracal-pipeline / caracal

Containerized Automated Radio Astronomy Calibration (CARACal) pipeline
GNU General Public License v2.0
29 stars 6 forks source link

selfcal worker crashes at cubical step when using singularity #850

Closed SpheMakh closed 4 years ago

SpheMakh commented 4 years ago

The container fails before cubical is invoked, so this either a problem with the singularity installation or stimela.

Running: singularity instance.start  --bind /data/sphe/OT-DATA/Stimela/stimela:/scratch/stimela:ro --bind /data/sphe/OT-DATA/stimela_parameter_files/calibrate_cubical_1_0-140588404261744158479576409737.json:/scratch/configfile:ro --bind /data/sphe/OT-DATA/Stimela/stimela/cargo/cab/cubical/src:/scratch/code:ro --bind /data/sphe/OT-DATA/Stimela/stimela/cargo/cab/singularity_run:/singularity:rw --bind /data/sphe/OT-DATA/msdir:/scratch/msdir:rw --bind /data/sphe/OT-DATA/input:/scratch/input:ro --bind /data/sphe/OT-DATA/output/logs/.current_job.log:/scratch/logfile:rw --bind /data/sphe/OT-DATA/output:/scratch/output:rw --bind /data/sphe/OT-DATA/output/tmp:/scratch/output/tmp:rw -c IMAGES/stimela_cubical_1.3.1.img 98e
Starting container [98e]. Timeout set to -1. The container ID is printed below.
Running: singularity run instance://98e
INFO:STIMELA-11:Recipe execution failed while running job calibrate_cubical_1_0

OS Description: Ubuntu 16.04.5 LTS Release: 16.04 Codename: xenial

Container tech Singularity 2.4-dist

bennahugo commented 4 years ago

Running singularity version 2.6.0-dist on stills and it works for me Possible cause for the issue. I've seen an API change from @KshitijT's trace

bennahugo commented 4 years ago

Double checking with my pipeline just now

bennahugo commented 4 years ago

Pulling

stimela pull --force -s -im stimela/cubical:1.3.1  

Works, Python 2.7:

2020-03-23 10:19:27,793 - STIMELA - INFO - STEP 1 :: de_calibrate_decal_3_0
INFO:STIMELA:STEP 1 :: de_calibrate_decal_3_0
INFO:root:Validating parameters...       CAB = cubical
INFO:root:Parameters validated and saved. Parameter file is: stimela_parameter_files/de_calibrate_decal_3_0-140445341571856158495155783.json
2020-03-23 10:19:27,808 de_calibrate_decal_3_0  - Instantiating container [643]. Timeout set to -1. The container ID is printed below.
Running: singularity instance.start  --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela:/scratch/stimela:ro --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/stimela_parameter_files/de_calibrate_decal_3_0-140445341571856158495155783.json:/scratch/configfile:ro --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela/cargo/cab/cubical/src:/scratch/code:ro --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela/cargo/cab/singularity_run:/singularity:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/msdir:/scratch/msdir:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/input:/scratch/input:ro --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/output/.currentjob.log:/scratch/logfile:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/output:/scratch/output:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/output/tmp:/scratch/output/tmp:rw -c /home/hugo/.stimela_images/stimela_cubical_1.3.1.img 6432020-03-23 10:19:29,817 de_calibrate_decal_3_0  - Starting container [643]. Timeout set to -1. The container ID is printed below.
2020-03-23 10:19:29,818 de_calibrate_decal_3_0  - Running: singularity run instance://643 

Starting virtual X frame buffer: Xvfb.
2020-03-23 10:19:30,429 de_calibrate_decal_3_0  - tch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg:/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg --model-ddes auto --weight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --madmax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term-iters 50,90,50,90 --sol-min-bl 110.0 --sol-max-bl 0 --bbc-save-to {out[name]}-BBC-field_{sel[field]}-ddid_{sel[ddid]}.parmdb --dist-max-chunks 20 --dist-nthread 20 --out-name /scratch/output/J1131-19551584951557.83 --out-overwrite 1 --out-mode sr --out-column DE_DATA --out-model-column MODEL_OUT --out-subtract-dirs 1: --out-casa-gaintables 0 --out-plots 1 --dist-nworker 20 --log-verbose solver=2 --g-time-int 32 --g-freq-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag --g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --dd-dd-term 1 --dd-time-int 32 --dd-freq-int 64 --dd-clip-low 0 --dd-clip-high 0 --dd-type complex-diag --dd-fix-dirs 0 --dd-max-prior-error 0.35 --dd-max-post-error 0.35 --dd-max-iter 200 --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --degridding-MinNFacetPerAxis 1
2020-03-23 10:19:31,432 de_calibrate_decal_3_0  - /usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:935: UserWarning: Attempting to work in a virtualenv. If you encounter problems, please install IPython inside the virtualenv.
2020-03-23 10:19:31,432 de_calibrate_decal_3_0  -   warn("Attempting to work in a virtualenv. If you encounter problems, please "
2020-03-23 10:19:35,045 de_calibrate_decal_3_0  - INFO       - 08:19:35 - solver             [0.2/0.2 5.2/5.2 0.0Gb] set console verbosity level solver=2
2020-03-23 10:19:35,045 de_calibrate_decal_3_0  - INFO       - 08:19:35 - main               [0.2/0.2 5.2/5.2 0.0Gb] started /usr/local/bin/gocubical --sol-jones g,dd --data-ms /scratch/msdir/A1300.1GC.dbg.ms --data-column CORRECTED_DATA --data-time-chunk 32 --data-freq-chunk 0 --model-list MODEL_DATA+-/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg:/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg --model-ddes auto --weight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --madmax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term-iters 50,90,50,90 --sol-min-bl 110.0 --sol-max-bl 0 --bbc-save-to {out[name]}-BBC-field_{sel[field]}-ddid_{sel[ddid]}.parmdb --dist-max-chunks 20 --dist-nthread 20 --out-name /scratch/output/J1131-19551584951557.83 --out-overwrite 1 --out-mode sr --out-column DE_DATA --out-model-column MODEL_OUT --out-subtract-dirs 1: --out-casa-gaintables 0 --out-plots 1 --dist-nworker 20 --log-verbose solver=2 --g-time-int 32 --g-freq-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag --g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --dd-dd-term 1 --dd-time-int 32 --dd-freq-int 64 --dd-clip-low 0 --dd-clip-high 0 --dd-type complex-diag --dd-fix-dirs 0 --dd-max-prior-error 0.35 --dd-max-post-error 0.35 --dd-max-iter 200 --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --degridding-MinNFacetPerAxis 1                                                                                                                                  
2020-03-23 10:19:35,045 de_calibrate_decal_3_0  - INFO       - 08:19:35 - main               [0.2/0.2 5.2/5.2 0.0Gb] output basename explicitly set to /scratch/output/J1131-19551584951557.83, --out-dir setting ignored
2020-03-23 10:19:35,045 de_calibrate_decal_3_0  - INFO       - 08:19:35 - main               [0.2/0.2 5.2/5.2 0.0Gb]  Selected Options:
2020-03-23 10:19:35,046 de_calibrate_decal_3_0  - INFO       - 08:19:35 - main               [0.2/0.2 5.2/5.2 0.0Gb] [data] Visibility data options
2020-03-23 10:19:35,046 de_calibrate_decal_3_0  - INFO       - 08:19:35 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - ms ................................................ = /scratch/msdir/A1300.1GC.dbg.ms
2020-03-23 10:19:35,046 de_calibrate_decal_3_0  - INFO       - 08:19:35 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - column ............................................ = CORRECTED_DATA
bennahugo commented 4 years ago

Works Python 3.6.8 (after unrelated residual python 3 fixes to the npyscreen package):

INFO:STIMELA:STEP 1 :: de_calibrate_decal_3_0
INFO:root:Validating parameters...       CAB = cubical
INFO:root:Parameters validated and saved. Parameter file is: stimela_parameter_files/de_calibrate_decal_3_0-1401676048785121584952204901566.json
Instantiating container [685]. Timeout set to -1. The container ID is printed below.
Running: singularity instance.start  --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela:/scratch/stimela:ro --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/stimela_parameter_files/de_calibrate_decal_3_0-1401676048785121584952204901566.json:/scratch/configfile:ro --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela/cargo/cab/cubical/src:/scratch/code:ro --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela/cargo/cab/singularity_run:/singularity:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/msdir:/scratch/msdir:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/input:/scratch/input:ro --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/output/.currentjob.log:/scratch/logfile:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/output:/scratch/output:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/output/tmp:/scratch/output/tmp:rw -c /home/hugo/.stimela_images/stimela_cubical_1.3.1.img 685
Starting container [685]. Timeout set to -1. The container ID is printed below.
Running: singularity run instance://685 

Starting virtual X frame buffer: Xvfb.
tch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg:/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg --model-ddes auto --weight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --madmax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term-iters 50,90,50,90 --sol-min-bl 110.0 --sol-max-bl 0 --bbc-save-to {out[name]}-BBC-field_{sel[field]}-ddid_{sel[ddid]}.parmdb --dist-max-chunks 20 --dist-nthread 20 --out-name /scratch/output/J1131-19551584952204.9014888 --out-overwrite 1 --out-mode sr --out-column DE_DATA --out-model-column MODEL_OUT --out-subtract-dirs 1: --out-casa-gaintables 0 --out-plots 1 --dist-nworker 20 --log-verbose solver=2 --g-time-int 32 --g-freq-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag --g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --dd-dd-term 1 --dd-time-int 32 --dd-freq-int 64 --dd-clip-low 0 --dd-clip-high 0 --dd-type complex-diag --dd-fix-dirs 0 --dd-max-prior-error 0.35 --dd-max-post-error 0.35 --dd-max-iter 200 --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --degridding-MinNFacetPerAxis 1
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:935: UserWarning: Attempting to work in a virtualenv. If you encounter problems, please install IPython inside the virtualenv.
  warn("Attempting to work in a virtualenv. If you encounter problems, please "
INFO       - 08:30:17 - solver             [0.2/0.2 5.2/5.2 0.0Gb] set console verbosity level solver=2
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb] started /usr/local/bin/gocubical --sol-jones g,dd --data-ms /scratch/msdir/A1300.1GC.dbg.ms --data-column CORRECTED_DATA --data-time-chunk 32 --data-freq-chunk 0 --model-list MODEL_DATA+-/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg:/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg --model-ddes auto --weight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --madmax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term-iters 50,90,50,90 --sol-min-bl 110.0 --sol-max-bl 0 --bbc-save-to {out[name]}-BBC-field_{sel[field]}-ddid_{sel[ddid]}.parmdb --dist-max-chunks 20 --dist-nthread 20 --out-name /scratch/output/J1131-19551584952204.9014888 --out-overwrite 1 --out-mode sr --out-column DE_DATA --out-model-column MODEL_OUT --out-subtract-dirs 1: --out-casa-gaintables 0 --out-plots 1 --dist-nworker 20 --log-verbose solver=2 --g-time-int 32 --g-freq-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag --g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --dd-dd-term 1 --dd-time-int 32 --dd-freq-int 64 --dd-clip-low 0 --dd-clip-high 0 --dd-type complex-diag --dd-fix-dirs 0 --dd-max-prior-error 0.35 --dd-max-post-error 0.35 --dd-max-iter 200 --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --degridding-MinNFacetPerAxis 1
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb] output basename explicitly set to /scratch/output/J1131-19551584952204.9014888, --out-dir setting ignored
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb]  Selected Options:
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb] [data] Visibility data options
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - ms ................................................ = /scratch/msdir/A1300.1GC.dbg.ms
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - column ............................................ = CORRECTED_DATA
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - time-chunk ........................................ = 32
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - freq-chunk ........................................ = 0
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - rebin-time ........................................ = 1
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - rebin-freq ........................................ = 1
INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb]  - chunk-by .......................................... = SCAN_NUMBER
bennahugo commented 4 years ago

Parameters used:

INFO       - 08:30:17 - main               [0.2/0.2 5.2/5.2 0.0Gb] started /usr/local/bin/gocubical --sol-jones g,dd --data-ms /scratch/msdir/A1300.1GC.dbg.ms --data-column CORRECTED_DATA --data-time-chunk 32 --data-freq-chunk 0 --model-list MODEL_DATA+-/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg:/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg --model-ddes auto --weight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --madmax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term-iters 50,90,50,90 --sol-min-bl 110.0 --sol-max-bl 0 --bbc-save-to {out[name]}-BBC-field_{sel[field]}-ddid_{sel[ddid]}.parmdb --dist-max-chunks 20 --dist-nthread 20 --out-name /scratch/output/J1131-19551584952204.9014888 --out-overwrite 1 --out-mode sr --out-column DE_DATA --out-model-column MODEL_OUT --out-subtract-dirs 1: --out-casa-gaintables 0 --out-plots 1 --dist-nworker 20 --log-verbose solver=2 --g-time-int 32 --g-freq-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag --g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --dd-dd-term 1 --dd-time-int 32 --dd-freq-int 64 --dd-clip-low 0 --dd-clip-high 0 --dd-type complex-diag --dd-fix-dirs 0 --dd-max-prior-error 0.35 --dd-max-post-error 0.35 --dd-max-iter 200 --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --degridding-MinNFacetPerAxis 1

Can't reproduce the problem above.

paoloserra commented 4 years ago

@bennahugo the initial post of this issue is that the container fails before cubical is invoked. I'm happy to hear that you don't have that problem, but I'm not sure we've found a solution that will work for everybody yet. Can we keep this issue open until it is clear what is causing the failure, and how all users can prevent it from happening?

gigjozsa commented 4 years ago

Again, is it possible to get a diagnose for future reference? I am not sure I understand what the core of the issue was.

bennahugo commented 4 years ago

The problem is it works straight out of the box for me on master using singularity 2.6.0-dist. Can you upgrade your version of singularity because as @paoloserra says it is failing before it even starts indicating that the container is never started.

bennahugo commented 4 years ago

Nothing fundamentally wrong in the python as far as I can tell. Can you please read output/.current_log after failure to see if nothing is printed there? Thanks!

SpheMakh commented 4 years ago

After much digging, I finally got the error from the cubical run:

meerkathi - 2020-03-23 21:57:24,866 CRITICAL - model-list MODEL_DATA --model-ddes auto --montblanc-dtype float --weight-column WEIGHT --madmax-enable 1 --madmax-estimate corr --madmax-plot 1 --madmax-threshold 0,10 --sol-term-iters 5,0 --bbc-save-to /scratch/output/continuum/selfcal_products/bbc-gains-1-1557766852_32chan-CGCG021_063_corr.parmdb --dist-ncpu 5 --dist-max-chunks 4 --out-name /scratch/output/continuum/selfcal_products/mypipelinerun_1557766852_32chan-CGCG021_063_corr_1_cubical --out-overwrite 1 --out-mode sc --out-casa-gaintables 1 --out-plots 1 --dist-nworker 0 --log-boring 1 --g-time-int 120 --g-freq-int 0 --g-clip-low 0.5 --g-clip-high 2.0 --g-solvable 1 --g-type phase-diag --g-save-to /scratch/output/continuum/selfcal_products/g-gains-1-1557766852_32chan-CGCG021_063_corr.parmdb --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 16 --degridding-MaxFacetSize 0.25 --degridding-MinNFacetPerAxis 1
meerkathi - 2020-03-23 21:57:25,971 CRITICAL - /usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:935: UserWarning: Attempting to work in a virtualenv. If you encounter problems, please install IPython inside the virtualenv.
meerkathi - 2020-03-23 21:57:25,971 CRITICAL -   warn("Attempting to work in a virtualenv. If you encounter problems, please "
meerkathi - 2020-03-23 21:57:27,076 CRITICAL - Traceback (most recent call last):
meerkathi - 2020-03-23 21:57:27,077 CRITICAL -   File "/usr/local/bin/gocubical", line 5, in <module>
meerkathi - 2020-03-23 21:57:27,077 CRITICAL -     from cubical.main import main
meerkathi - 2020-03-23 21:57:27,077 CRITICAL -   File "/usr/local/lib/python3.6/dist-packages/cubical/main.py", line 93, in <module>
meerkathi - 2020-03-23 21:57:27,078 CRITICAL -     from cubical.data_handler.ms_data_handler import MSDataHandler
meerkathi - 2020-03-23 21:57:27,078 CRITICAL -   File "/usr/local/lib/python3.6/dist-packages/cubical/data_handler/ms_data_handler.py", line 24, in <module>
meerkathi - 2020-03-23 21:57:27,078 CRITICAL -     from cubical.data_handler.ms_tile import RowChunk, MSTile
meerkathi - 2020-03-23 21:57:27,078 CRITICAL -   File "/usr/local/lib/python3.6/dist-packages/cubical/data_handler/ms_tile.py", line 22, in <module>
meerkathi - 2020-03-23 21:57:27,079 CRITICAL -     from .DicoSourceProvider import DicoSourceProvider
meerkathi - 2020-03-23 21:57:27,079 CRITICAL -   File "/usr/local/lib/python3.6/dist-packages/cubical/data_handler/DicoSourceProvider.py", line 3, in <module>
meerkathi - 2020-03-23 21:57:27,079 CRITICAL -     from DDFacet.Imager import ModModelMachine as MMM
meerkathi - 2020-03-23 21:57:27,079 CRITICAL -   File "/usr/local/lib/python3.6/dist-packages/DDFacet/__init__.py", line 23, in <module>
meerkathi - 2020-03-23 21:57:27,080 CRITICAL -     __version__ = pkg_resources.require("DDFacet")[0].version
meerkathi - 2020-03-23 21:57:27,080 CRITICAL -   File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 901, in require
meerkathi - 2020-03-23 21:57:27,080 CRITICAL -     needed = self.resolve(parse_requirements(requirements))
meerkathi - 2020-03-23 21:57:27,080 CRITICAL -   File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 792, in resolve
meerkathi - 2020-03-23 21:57:27,081 CRITICAL -     raise VersionConflict(dist, req).with_context(dependent_req)
meerkathi - 2020-03-23 21:57:27,081 CRITICAL - pkg_resources.ContextualVersionConflict: (scipy 1.4.1 (/usr/local/lib/python3.6/dist-packages), Requirement.parse('scipy<=1.3.3'), {'DDFacet'})
meerkathi - 2020-03-23 21:57:27,081 CRITICAL - If you suspect this is an IPython 7.12.0 bug, please report it at:
meerkathi - 2020-03-23 21:57:27,081 CRITICAL -     https://github.com/ipython/ipython/issues
meerkathi - 2020-03-23 21:57:27,082 CRITICAL - or send an email to the mailing list at ipython-dev@python.org
meerkathi - 2020-03-23 21:57:27,082 CRITICAL - You can print a more detailed traceback right now with "%tb", or use "%debug"
meerkathi - 2020-03-23 21:57:27,082 CRITICAL - to interactively debug it.
meerkathi - 2020-03-23 21:57:27,082 CRITICAL - Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
meerkathi - 2020-03-23 21:57:27,083 CRITICAL -     %config Application.verbose_crash=True
meerkathi - 2020-03-23 21:57:28,891 CRITICAL - Traceback (most recent call last):
meerkathi - 2020-03-23 21:57:28,891 CRITICAL -   File "/scratch/code/run.py", line 55, in <module>
meerkathi - 2020-03-23 21:57:28,891 CRITICAL -     utils.xrun(cab['binary'], parset+opts)
meerkathi - 2020-03-23 21:57:28,892 CRITICAL -   File "/scratch/stimela/utils/__init__.py", line 181, in xrun
meerkathi - 2020-03-23 21:57:28,892 CRITICAL -     '%s: returns errr code %d' % (command, process.returncode))
meerkathi - 2020-03-23 21:57:28,892 CRITICAL - utils.StimelaCabRuntimeError: gocubical: returns errr code 1

Looks like a broken DDF dependency

SpheMakh commented 4 years ago

May need to teak the DDF Dockerfile a little bit

KshitijT commented 4 years ago

This is in the stimela master branch?

SpheMakh commented 4 years ago

Yup, but only for singularity. Docker works just fine

bennahugo commented 4 years ago

Please do not modify the images. I think I know what is going on: I just tested cubical in Python2.7 and 3 with singularity and the very latest image in master as I've shown - both on stills.ru and my desktop.

Three things:

bennahugo commented 4 years ago

I find singularity a very poor substitute for docker. Maybe we should drop support in favour of podman

bennahugo commented 4 years ago

I'm running the same test again myself with the steps outlined above. Give me a few minutes.

bennahugo commented 4 years ago

Confirmed problem is your setup. Please check:

WARNING: pull for Docker Hub is not guaranteed to produce the
WARNING: same image on repeated pull. Use Singularity Registry
WARNING: (shub://) to pull exactly equivalent images.
Docker image path: index.docker.io/stimela/cubical:1.3.1
Cache folder set to /home/hugo/.singularity/docker

Remove the cache folder, remove the image and remove .local. Repull the image

INFO:STIMELA:STEP 1 :: de_calibrate_decal_3_0
INFO:root:Validating parameters...       CAB = cubical
INFO:root:Parameters validated and saved. Parameter file is: stimela_parameter_files/de_calibrate_decal_3_0-1397541405695201585030427110587.json
Instantiating container [85e]. Timeout set to -1. The container ID is printed below.
Running: singularity instance.start  --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela:/scratch/stimela:ro --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/stimela_parameter
_files/de_calibrate_decal_3_0-1397541405695201585030427110587.json:/scratch/configfile:ro --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela/cargo/cab/cubical/src:/scra
tch/code:ro --bind /home/hugo/CLUSTER_SURVEY/A1300/stimela/stimela/cargo/cab/singularity_run:/singularity:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/msdir:/scratch/msdir
:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/input:/scratch/input:ro --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/output/.currentjob.log:/scratch/logfile:rw --bind /home/hu
go/CLUSTER_SURVEY/A1300.dbg/output:/scratch/output:rw --bind /home/hugo/CLUSTER_SURVEY/A1300.dbg/output/tmp:/scratch/output/tmp:rw -c /home/hugo/.stimela_images/stimela_cub
ical_1.3.1.img 85e
Starting container [85e]. Timeout set to -1. The container ID is printed below.
Running: singularity run instance://85e

Starting virtual X frame buffer: Xvfb.
tch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg:/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg --model-ddes auto --w
eight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --madmax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term
-iters 50,90,50,90 --sol-min-bl 110.0 --sol-max-bl 0 --bbc-save-to {out[name]}-BBC-field_{sel[field]}-ddid_{sel[ddid]}.parmdb --dist-max-chunks 20 --dist-nthread 20 --out-n
ame /scratch/output/J1131-19551585030427.1105072 --out-overwrite 1 --out-mode sr --out-column DE_DATA --out-model-column MODEL_OUT --out-subtract-dirs 1: --out-casa-gaintab
les 0 --out-plots 1 --dist-nworker 20 --log-verbose solver=2 --g-time-int 32 --g-freq-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag 
--g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --dd-dd-term 1 --dd-time-int 32 --dd-freq-int 64 --dd-clip-low 0 --dd-clip-high 0 --dd-type complex-diag -
-dd-fix-dirs 0 --dd-max-prior-error 0.35 --dd-max-post-error 0.35 --dd-max-iter 200 --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --d
egridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --degridding-MinNFacetPerAxis 1
^[[B^[[B^[[B/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:935: UserWarning: Attempting to work in a virtualenv. If you encounter problems, please 
install IPython inside the virtualenv.
  warn("Attempting to work in a virtualenv. If you encounter problems, please "
INFO       - 06:13:57 - solver             [0.2/0.2 5.2/5.2 0.0Gb] set console verbosity level solver=2
INFO       - 06:13:57 - main               [0.2/0.2 5.2/5.2 0.0Gb] started /usr/local/bin/gocubical --sol-jones g,dd --data-ms /scratch/msdir/A1300.1GC.dbg.ms --data-column
 CORRECTED_DATA --data-time-chunk 32 --data-freq-chunk 0 --model-list MODEL_DATA+-/scratch/output/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg:/scratch/out
put/J1131-1955_decal_3.DicoModel@/scratch/output/decal_3.0.dE.reg --model-ddes auto --weight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --m
admax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term-iters 50,90,50,90 --sol-min-bl 110.0 --sol-max-bl 0 --bbc-save-to {out[name]}-BBC-fi
eld_{sel[field]}-ddid_{sel[ddid]}.parmdb --dist-max-chunks 20 --dist-nthread 20 --out-name /scratch/output/J1131-19551585030427.1105072 --out-overwrite 1 --out-mode sr --ou
t-column DE_DATA --out-model-column MODEL_OUT --out-subtract-dirs 1: --out-casa-gaintables 0 --out-plots 1 --dist-nworker 20 --log-verbose solver=2 --g-time-int 32 --g-freq
-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag --g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --dd-dd-term 1 --dd
-time-int 32 --dd-freq-int 64 --dd-clip-low 0 --dd-clip-high 0 --dd-type complex-diag --dd-fix-dirs 0 --dd-max-prior-error 0.35 --dd-max-post-error 0.35 --dd-max-iter 200 -
-degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --de
gridding-MinNFacetPerAxis 1   
INFO       - 06:13:57 - main               [0.2/0.2 5.2/5.2 0.0Gb] output basename explicitly set to /scratch/output/J1131-19551585030427.1105072, --out-dir setting ignored
INFO       - 06:13:57 - main               [0.2/0.2 5.2/5.2 0.0Gb]  Selected Options:

I am sitting on Python 3.6.8:

commit a62cfbf839b96fc97e770259b49fb925ec373e73 (HEAD -> master, origin/master, origin/HEAD)
Merge: accc93d d901c59
Author: Sphesihle Makhathini <sphemakh@gmail.com>
Date:   Sun Mar 22 20:56:24 2020 +0200

    Merge pull request #533 from ratt-ru/log2file

    revive step file logging
SpheMakh commented 4 years ago

Fixed in https://github.com/ratt-ru/Stimela/commit/250080c8a9b3608e15a6214f1a1f032cd66bc4c1

gigjozsa commented 4 years ago

Testing this now...

gigjozsa commented 4 years ago

I used MeerKATHI master with Stimela master, crashed when executing cubical in singularity, all info below.

@SpheMakh please see the test results: System information: master_sysinfo.txt Script executed: master.sh.txt Config file: carateConfig_singularity.yml.txt Resulting log: log-meerkathi_test_carateConfig_singularity.txt

The first run using docker ran through all right.

bennahugo commented 4 years ago

@SpheMakh I asked you not to install dependencies into the image. I do not think it is the image itself is at fault. Have you tested removing the singularity cache? Somewhere there is is a SHA mismatch coming in because this image works correctly in Docker (I'm using that on stevie). Please do not merge this yet.

bennahugo commented 4 years ago

I request that we revert this, because I want to get to the root cause and not just put in a temporary kludge plate to fix this. If it is a core bug in sigularity we need to try raise the issue upstream

gigjozsa commented 4 years ago

@bennahugo , I think this is a Stimela issue. If true, could you discuss in Stimela github, such that we have a clear separation (not that @SpheMakh would not listen at both sites)?

SpheMakh commented 4 years ago

@gigjozsa did you follow these instructions https://github.com/ska-sa/meerkathi/issues/850#issuecomment-603038752 ? If you did it should work

SpheMakh commented 4 years ago

@gigjozsa did you follow these instructions #850 (comment) ? If you did it should work

Ignore this, I confused this with a different issue.

The issue here is not cubical or stimela, it's that you machine uses an outdated version of singularity. If you upgrade to at least version 2.6.0-dist the error will go away because it's been fixed in that version of singularity.