caracal-pipeline / caracal

Containerized Automated Radio Astronomy Calibration (CARACal) pipeline
GNU General Public License v2.0
28 stars 6 forks source link

CARACal INFO: exiting with error code 1 #1533

Closed kmhess closed 1 month ago

kmhess commented 10 months ago

Hi,

I'm trying to run CARAcal on a slurm system that is not ilifu. As far as I know it's "correctly" installed (although I did not do it). Nonetheless, something is clearly wrong because I get the following error and I don't know what to do with it. The .stimela_workdir-16938348029524467 directory is present while caracal is running at some point, but then disappears by the time it crashes out with this error message. I tried export SINGULARITY_PULLFOLDER=/scratch/users/putyourusernamehere/STIMELA_IMAGES_NEW recommendation from this issue, but it made no difference: https://github.com/caracal-pipeline/caracal/issues/1087

2023-09-04 15:45:20 CARACal.Stimela.summary_json-ms0-0 ERROR: cd /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/1_HI_caracal/1613847072/.stimela_workdir-16938348029524467 && singularity run --userns --workdir /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/1_HI_caracal/1613847072/.stimela_workdir-16938348029524467 --containall returns error code 1
2023-09-04 15:45:20 CARACal.Stimela.summary_json-ms0-0 ERROR: job failed at 2023-09-04 15:45:20.739173 after 0:00:28.085690
2023-09-04 15:45:21 CARACal ERROR: Job 'summary_json-ms0-0:: Get observation information as a json file ms=1613847072_sdp_l0_HI-cal.ms' failed: cd /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/1_HI_caracal/1613847072/.stimela_workdir-16938348029524467 && singularity run --userns --workdir /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/1_HI_caracal/1613847072/.stimela_workdir-16938348029524467 --containall returns error code 1 [PipelineException]
2023-09-04 15:45:21 CARACal INFO:   More information can be found in the logfile at output_1613847072/logs-20230904-153954/log-caracal.txt
2023-09-04 15:45:21 CARACal INFO: exiting with error code 1

Thanks in advance for your help.

KshitijT commented 10 months ago

@kmhess, could you please share the log and the config file ?

pharaofranz commented 10 months ago

Just a bit more info about the issue we're facing (I did the install broadly following what's on github). This is a system that runs apptainer, all stimula-images live in a directory pointed at via ${CARACAL_IMAGES}. The pipeline runs in an environment with python 3.9.6 and is invoked as

caracal -c config.yml -ct singularity -sid ${CARACAL_IMAGES}

As far as I can tell, the singularity images are found and individual tasks run but the return code from apptainer is (mis-?) interpreted as an error. If one re-runs the same caracal command again, it finds the previously generated output, skips that step and moves on to the next task. This also runs to completion but seems to return an error. In the attached log files, one can see that it takes three caracal runs for the pipeline to finish. The output of listobs, summary_json-ms0, and elevation-plots-ms0 all look fine -- despite the 'exit code 1' messages.

I found a vaguely related issue #1361 but in our case it's not the ctrl-c'ed singularity images.

The config looks like this:

` schema_version: 1.0.4

general: title: '' rawdatadir: '/cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/0_HI_raw' msdir: msdir input: input output: output prefix: '1613847072' # !!! rename for each data set final_report: false

getdata: dataid: ['1613847072_sdp_l0_HI']

obsconf: obsinfo: enable: true target:

`

logs-20230904-171450.txt logs-20230904-172054.txt logs-20230904-171822.txt

paoloserra commented 10 months ago

Have you tried with CARACal's latest stable release?

pharaofranz commented 10 months ago

That's v1.0.7, is it? -- not yet.

paoloserra commented 10 months ago

Yep that's right -- with an unfortunate error here

https://github.com/caracal-pipeline/caracal/blob/84299e30a6197f59721146e2c8063d61be694432/setup.py#L30

pharaofranz commented 10 months ago

It's actually not entirely clear to me which version we're running. pip tells me it's 1.1.1, caracal --version tells me it's 1.0.6 as also seen in the logs.

paoloserra commented 10 months ago

mmmm... I let others comment on this, I'm not the best in the team with software versions, pip, etc etc (as the above error shows)

pharaofranz commented 10 months ago

No joy with v1.0.7 -- exactly the same issue. logs-20230905-103639.txt logs-20230905-103933.txt logs-20230905-104124.txt

paoloserra commented 10 months ago

Which Stimela version are you running? I would suggest to make one last try with Stimela 1.7.6, which is the stable release I use succesfully with Caracal 1.0.7.

Sorry for the pain ...

pharaofranz commented 10 months ago

I was running 1.7.8, downgraded to 1.7.6. No joy :-( Same issue.

pharaofranz commented 10 months ago

Is it possible to get more info from the singularity images than just 'finished with exit code 1'? The temporary work directory ./stimela_workdir-<random_number_string> disappears right after the pipeline ran such that one cannot re-run the command that threw the error by oneself. Using the --debug flag didn't really help much either yet.

pharaofranz commented 10 months ago

I now reverse-engineered what caracal is doing in terms of running singularity images.

I now ran caracal again with -debug enabled as I realized that one gets to see the actual singularity command that is run. Once the script drops into pdb, I copy/rsync the content of .stimela_workdir-<random-number> somewhere else before exiting pdb. I then copy that directory back to where it was before with the same name as before. I then define a bunch of environment variables that are being reported as set (e.g. export SINGULARITYENV_STIMELA_MOUNT=/stimela_mount; export SINGULARITYENV_OUTPUT=${SINGULARITYENV_STIMELA_MOUNT}/output and so on). Then I rerun the singularity command as I see it from the pipeline, i.e.

cd /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493 && singularity run --workdir /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493 --containall --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493/stimela_parameter_files/elevation_plots_ms0-2304369160654416939197497773852.json:/stimela_mount/configfile:ro --bind /cephyr/NOBACKUP/groups/hess/franz/software/caracal-1.0.7/lib/python3.9/site-packages/stimela/cargo/cab/owlcat_plotelev/src:/stimela_mount/code:ro --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493/passwd:/etc/passwd:rw --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493/group:/etc/group:rw --bind /cephyr/NOBACKUP/groups/hess/franz/software/caracal-1.0.7/bin/stimela_runscript:/singularity:ro --bind /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/0_HI_raw:/stimela_mount/msdir:rw --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/input:/stimela_mount/input:ro --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/msdir:/stimela_mount/output:rw --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/msdir/tmp:/stimela_mount/output/tmp:rw /cephyr/NOBACKUP/groups/hess/franz/software/caracal-images2/stimela_owlcat_1.6.6.img /singularity

As expected, this finishes without errors and I get a nice new elevation plot.

IMHO this has nothing to do with singularity or apptainer. But I have a hard time figuring out where else to look now. I am attaching the latest log that I have with [log-caracal.txt](https://github.com/caracal-pipeline/caracal/files/12525157/log-caracal.txt) enabled.

paoloserra commented 10 months ago

@SpheMakh I think we need your input here

pharaofranz commented 10 months ago

We might have solved the issue. Until now, I first loaded the python 3.9.6 module via module load python... and then created the virtual environment. This puts a bunch of things into $LD_LIBRARY_PATH and ${PATH}.

If I do not load the module but instead use the system python (3.6.8) to create the virtual env and then build caracal (v1.0.7, stimela version 1.7.9), everything runs just fine. I am attaching here what's in env for the case when loading python as a module and when using the system python. Maybe that helps to find the problem.

module_load_python_3.9.6_env.txt system_python_3.6.8_env.txt

pharaofranz commented 10 months ago

Haha, fun fact: it would appear somebody updated stimela to version 1.7.9 since yesterday. This comes with a bunch of new singularity images. That only coincided with my system-python test leading me to false conclusions as the system-python build of caracal pulled stimela 1.7.9... Everything works fine with stimela_1.7.9 -- both the system-python and the module-load-python.

paoloserra commented 10 months ago

OK @pharaofranz , many thanks for the detailed reporting. Maybe one of the Stimela folks could comment on this before we close the issue?

Athanaseus commented 10 months ago

Hi @pharaofranz, thank you for the issue. I can confirm that the latest stimela-classic version 1.7.9 added support for apptainer/singularity. I will also attempt to reproduce this error to see why it occurred. @francescaLoi may have reported a related issue due to a different environment setup. I'm preparing a pre-release of caracal with the latest updates, and any apptainer-related issues can be tested against this.

Athanaseus commented 1 month ago

Please re-open if experiencing the issue.