ACES-CMZ / reduction_ACES

Reduction scripts and tools for ACES
https://worldwidetelescope.org/webclient/?wtml=https://data.rc.ufl.edu/pub/adamginsburg/ACES/mosaics/mosaics.wtml
15 stars 12 forks source link

Reimaging of A001_X15a0_X190: Segmentation fault #80

Closed keflavich closed 2 years ago

keflavich commented 2 years ago

In attempting to re-image 190, I get the following logs:

run_inplace_mpi_19876601.log casa-20220302-161732.log

The run failed with:

srun: error: c0707a-s8: task 0: Segmentation fault (core dumped)

(srun is being used in place of mpirun following recommendations from the HPC team here)

The failure happens in the imageprecheck step, which I'll try skipping.

The casa log is from an earlier run, I think, and complains about bad pickling.

keflavich commented 2 years ago

This segfault is persistent:

2022-03-02 21:16:43 INFO: casatools.imager.selectvis(vis='uid___A002_Xf53eeb_X323e.ms', field='Sgr_A_star', spw=33, scan='10,7,16,13', taql='ANTENNA1==0||
...
8||ANTENNA2==39||ANTENNA2==40', usescratch=False, writeaccess=False) CASA tool call took 11.562737s
srun: error: c0704a-s21: task 0: Segmentation fault (core dumped)

Looks like this is happening after the selectvis call.

keflavich commented 2 years ago

ah, actually, didn't need that large a selection:

2022-03-02 21:19:17 INFO: casatools.imager.selectvis(vis='uid___A002_Xf531c1_X16b2.ms', field='Sgr_A_star', spw=31, scan='10,7,16,13', taql='ANTENNA1==0||ANTENNA1==1||ANTENNA1==2||ANTENNA1==3||ANTENNA1==4||ANTENNA1==5||ANTENNA1==6||ANTENNA1==7||ANTENNA1==8||ANTENNA1==9||ANTENNA1==10||ANTENNA1==11||ANTENNA1==12||ANTENNA1==13||ANTENNA1==14||ANTENNA1==15||ANTENNA1==16||ANTENNA1==17||ANTENNA1==18||ANTENNA1==19||ANTENNA1==20||ANTENNA1==21||ANTENNA1==22||ANTENNA1==23||ANTENNA1==24||ANTENNA1==25||ANTENNA1==26||ANTENNA1==27||ANTENNA1==28||ANTENNA1==29||ANTENNA1==30||ANTENNA1==31||ANTENNA1==32||ANTENNA1==33||ANTENNA1==34||ANTENNA1==35||ANTENNA1==36||ANTENNA1==37||ANTENNA1==38||ANTENNA1==39||ANTENNA1==40&&ANTENNA2==0||ANTENNA2==1||ANTENNA2==2||ANTENNA2==3||ANTENNA2==4||ANTENNA2==5||ANTENNA2==6||ANTENNA2==7||ANTENNA2==8||ANTENNA2==9||ANTENNA2==10||ANTENNA2==11||ANTENNA2==12||ANTENNA2==13||ANTENNA2==14||ANTENNA2==15||ANTENNA2==16||ANTENNA2==17||ANTENNA2==18||ANTENNA2==19||ANTENNA2==20||ANTENNA2==21||ANTENNA2==22||ANTENNA2==23||ANTENNA2==24||ANTENNA2==25||ANTENNA2==26||ANTENNA2==27||ANTENNA2==28||ANTENNA2==29||ANTENNA2==30||ANTENNA2==31||ANTENNA2==32||ANTENNA2==33||ANTENNA2==34||ANTENNA2==35||ANTENNA2==36||ANTENNA2==37||ANTENNA2==38||ANTENNA2==39||ANTENNA2==40', usescratch=False, writeaccess=False) CASA tool call took 11.562737s
srun: error: c0704a-s21: task 0: Segmentation fault (core dumped)

This occurs in the file pipeline/hif/heuristics/imageparams_base.py in the synthesized_beams function

afaict, the next thing CASA does is:

rtn = casa_tools.imager.advise(takeadvice=False, amplitudeloss=0.5, fieldofview=aipsfieldofview)
casa_tools.imager.done()

but there's a lot of stuff that gets done w/o any log messages, so I can't tell for sure if the failure is on that or earlier or later.

keflavich commented 2 years ago

I've narrowed it down to the advise line.

keflavich commented 2 years ago

I managed to get more information out by running w/mpirun:

2022-03-08 15:48:46 INFO: Running casa_tools.imager.advise with fieldofview=126.3arcsec
[c0706a-s15:124498] *** Process received signal ***
[c0706a-s15:124498] Signal: Segmentation fault (11)
[c0706a-s15:124498] Signal code: Invalid permissions (2)
[c0706a-s15:124498] Failing at address: 0x92e3040
[c0706a-s15:124498] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aedfab14630]
[c0706a-s15:124498] [ 1] [0x92e3040]
[c0706a-s15:124498] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 124498 on node c0706a-s15 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
piposona commented 2 years ago

Data processing team at JAO contacted, and waiting for a reply, in case this is obvious to them.

keflavich commented 2 years ago

There's no obvious sign of corrupted data; the downloaded files all have the right file size and the ASDM tarballs extracted without issue. I'm deleting the calibrated/ directory and re-running the pipeline anyway, when I get to it.

piposona commented 2 years ago

I got an answer from them which I translate here (since that was in Spanish).

Segmentation fault is too obscure as it could be anything. I have been checking the CASA log and I found the following

2022-03-02 16:17:47  INFO    hifa_imageprecheck::::casa            ##########################################

2022-03-02 16:17:47  INFO    hifa_imageprecheck::::casa    ##### Begin Task: hifa_imageprecheck #####

2022-03-02 16:17:47  INFO    hifa_imageprecheck::::casa    hifa_imageprecheck( vis=[], calcsb=False, parallel='automatic', pipelinemode='automatic', dryrun=False, acceptresults=True )

2022-03-02 16:17:47  INFO    hifa_imageprecheck::pipeline.infrastructure.basetask::casa Starting execution for stage 2

2022-03-02 16:17:47  INFO    hifa_imageprecheck::pipeline.infrastructure.basetask::casa Equivalent CASA call: hifa_imageprecheck(parallel='automatic', calcsb=False)

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa Error executing pipeline task hifa_imageprecheck.

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa Traceback (most recent call last):

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa+            File "/blue/adamginsburg/adamginsburg/casa/casa-6.2.1-7-pipeline-2021.2.0.128/lib/py/lib/python3.6/site-packages/Pipeline-2021.2.0.128-py3.6.egg/pipeline/infrastructure/basetask.py", line 660, in execute

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa+              result = self.prepare(**prepare_parameters)

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa+            File "/blue/adamginsburg/adamginsburg/casa/casa-6.2.1-7-pipeline-2021.2.0.128/lib/py/lib/python3.6/site-packages/Pipeline-2021.2.0.128-py3.6.egg/pipeline/hifa/tasks/imageprecheck/imageprecheck.py", line 150, in prepare

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa+              repr_target, repr_source, repr_spw, repr_freq, reprBW_mode, real_repr_target, minAcceptableAngResolution, maxAcceptableAngResolution, maxAllowedBeamAxialRatio, sensitivityGoal = image_heuristics.representative_target()

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa+            File "/blue/adamginsburg/adamginsburg/casa/casa-6.2.1-7-pipeline-2021.2.0.128/lib/py/lib/python3.6/site-packages/Pipeline-2021.2.0.128-py3.6.egg/pipeline/hif/heuristics/imageparams_base.py", line 911, in representative_target

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa+              repr_target = repr_ms.representative_target

2022-03-02 16:17:47  SEVERE            hifa_imageprecheck::pipeline.infrastructure.basetask::casa+            AttributeError: 'NoneType' object has no attribute 'representative_target'

2022-03-02 16:17:48  SEVERE            hifa_imageprecheck::pipeline.h.cli.utils::casa            Warning: Check merge to context for FailedTaskResults

2022-03-02 16:17:48  SEVERE            hifa_imageprecheck::::casa    Task hifa_imageprecheck raised an exception of class TypeError with the following message: unsupported operand type(s) for -: 'NoneType' and 'datetime.datetime'

2022-03-02 16:17:48  INFO    hifa_imageprecheck::::casa    Task hifa_imageprecheck complete. Start time: 2022-03-02 11:17:47.217459 End time: 2022-03-02 11:17:47.959564

2022-03-02 16:17:48  INFO    hifa_imageprecheck::::casa    ##### End Task: hifa_imageprecheck   #####

2022-03-02 16:17:48  INFO    hifa_imageprecheck::::casa            ##########################################

No idea what you are trying to do, but it seems that you are not including the representative target in the pipeline execution, and therefore the imageprecheck task fails. Maybe you split the ms and removed that target? Further down the CASA log it complains about the same in hif_makeimlist task Looking at the other log (not sure where it comes from) it appears the following

2022-03-02 16:50:49 SEVERE mstransform::::casa Task mstransform raised an exception of class RuntimeError with the following message: Desired column (CORRECTED_DATA) not found in the input MS (/orange/adamginsburg/ACES/rawdata/2021.1.00172.L/science_goal.uid___A001_X1590_X30a8/group.uid___A001_X1590_X30a9/member.uid___A001_X15a0_X190/calibrated/working/uid___A002_Xf53eeb_X323e_target.ms).

That is, it seems that you are not using a calibrated MS (i.e. containing CORRECTED_DATA column)?

She is happy to further help, but probably at this point we need to explain to her, exactly what we did with the data after downloading them from archive. I guess there is not any major change or any split involved, right?

keflavich commented 2 years ago

Thanks, this is helpful! No, I did nothing except run the pipeline.

keflavich commented 2 years ago

I re-ran things from scratch, now encountering a different error (which could be related to mpi, perhaps):

2022-03-10 19:52:31 INFO: Executing plotms(vis='uid___A002_Xf531c1_X16b2.ms', xaxis='azimuth', yaxis='elevation', spw='25:0~0,27:0~0,29:0~0,31:0~0,33:0~0,35:0~0', antenna='0&&*', avgchannel='9000', avgtime='10', coloraxis='field', customflaggedsymbol=True, flaggedsymbolshape='autoscaling', title='Elevation vs Azimuth for uid___A002_Xf531c1_X16b2.ms', plotfile='azel.png', showgui=False, clearplots=True)
fuse: failed to exec fusermount: No such file or directory

Cannot mount AppImage, please check your FUSE setup.
You might still be able to extract the contents of this AppImage
if you run it with the --appimage-extract option.
See https://github.com/AppImage/AppImageKit/wiki/FUSE
for more information
open dir error: No such file or directory
piposona commented 2 years ago

Do you want me to download the data from the archive (not from globus) and run the scrip for PI from scratch, to see if I reproduce the error?

keflavich commented 2 years ago

This isn't the scriptForPI, this is reimaging:

from casarecipes.almahelpers import fixsyscaltimes # SACM/JAO - Fixes
from casatasks import fixplanets
context = h_init()
context.set_state('ProjectSummary', 'proposal_code', '2021.1.00172.L')
context.set_state('ProjectSummary', 'proposal_title', 'unknown')
context.set_state('ProjectSummary', 'piname', 'unknown')

import os
import sys
cwd = os.getcwd()
mous = cwd.split("/")[-3].split(".")[1].split("_")
ous_entity_id = f"{mous[0]}://{mous[3]}/{mous[4]}/{mous[5]}"

context.set_state('ProjectStructure', 'ous_entity_id', ous_entity_id)
context.set_state('ProjectStructure', 'ous_part_id', 'Undefined')
context.set_state('ProjectStructure', 'ous_title', 'Undefined')
context.set_state('ProjectStructure', 'ousstatus_entity_id', ous_entity_id)
context.set_state('ProjectStructure', 'recipe_name', 'hifa_calimage_renorm')

import glob
mses = glob.glob("*.ms")
if len(mses) == 0:
    raise ValueError("No MSes found")

print(f"Starting imaging pipeline rerun for mses {mses} and for mouse {mous} in {cwd}")

try:
    hifa_importdata(vis=mses, dbservice=False)
    hif_mstransform(pipelinemode="automatic")
    #hifa_flagtargets(pipelinemode="automatic")
    #hifa_imageprecheck(pipelinemode="automatic")
    hif_makeimlist(specmode='mfs')
    hif_findcont(pipelinemode="automatic")
    hif_uvcontfit(pipelinemode="automatic", fitorder=0)
    hif_uvcontsub(pipelinemode="automatic")
    hif_makeimages(pipelinemode="automatic")
    hif_makeimlist(specmode='cont')
    hif_makeimages(pipelinemode="automatic")
    hif_makeimlist(specmode='cube')
    hif_makeimages(pipelinemode="automatic")
finally:
    h_save()

But, yes, that would be very helpful!

piposona commented 2 years ago

I will try, but I may need to be given access in the alma portal to download from the archive . Unless you want me to try from globus. If i manage to reproduce the error here, that would be much easier to share with data managers.

keflavich commented 2 years ago

The current error doesn't look like data corruption, so downloading from globus is fine

piposona commented 2 years ago

Ok, i will try. Give me a few days...

keflavich commented 2 years ago

The latest issue, about fusermount failing, has a very hacky workaround.

There's supposed to be an easy, well-supported workaround, which is to do:

export APPIMAGE_EXTRACT_AND_RUN=1
${CASAPATH}/lib/py/lib/python3.6/site-packages/casaplotms/__bin__/casaplotms-x86_64.AppImage

which is what CASA calls internally when you call plotms. However, that environmental variable had no effect when I set it:

APPIMAGE_EXTRACT_AND_RUN=1 /orange/adamginsburg/casa/casa-6.2.1-7-pipeline-2021.2.0.128/lib/py/lib/python3.6/site-packages/casaplotms/__bin__/casaplotms-x86_64.AppImage
fuse: failed to exec fusermount: No such file or directory

Cannot mount AppImage, please check your FUSE setup.
You might still be able to extract the contents of this AppImage
if you run it with the --appimage-extract option.
See https://github.com/AppImage/AppImageKit/wiki/FUSE
for more information
open dir error: No such file or directory

I have verified that this exact syntax works with CARTA, so it's something unique about CASA's tools - possibly the AppImage version.

So, the hack is to change the CASA source:

                    __proc = Popen( [ app_path, '--appimage-extract',
                                     '--nopopups', '--logfilename=%s' % casalog.logfile( ),
                                    '--server=%s' % ctsys.registry( )['uri'] ] + data_path,
#                                    stdin=DEVNULL, stdout=DEVNULL, stderr=STDOUT,
#                                   stdin=DEVNULL, stdout=STDOUT, stderr=STDOUT,
                                    close_fds=True )
                         __proc = Popen( [ app_path, '--appimage-extract', '--nopopups', '--logfilename=%s' % casalog.logfile( ), '--server=%s' % np_path ] + data_path )

to add the --appimage-extract argument as the first argument after the app name (if it is not the first argument, this doesn't work).

This is a terrible and unsustainable solution, but for now it works.

keflavich commented 2 years ago

the hack above changed the behavior of plotms, but didn't solve any problems

keflavich commented 2 years ago

I've gotten it working by making __launch() return immediately in lib/py/lib/python3.6/site-packages/casaplotms/private/plotmstool.py. Again, this is an atrocious hack that guarantees no plots will be made, but at least it's running the pipeline now.

piposona commented 2 years ago

I was about to run the script today, since the data were just downloaded. Does this mean it is not needed?

keflavich commented 2 years ago

no no, I had to hack CASA to death (it is really dead now) to get it to run. I think it works now, but not in a nice and reproducible way!

piposona commented 2 years ago

Believe it or not, the script is still running. Just started the first hif_makeimages(pipelinemode="automatic").

keflavich commented 2 years ago

you're running with MPI, right? It is quite clear, on my end, that the segfaults, etc. are caused by MPI

piposona commented 2 years ago

Oh! That is a good question! I think I am not, or at least I did not force it in any way. How do you make it work with MPI? After this thing is done (or even in parallel) I could try with MPI.

keflavich commented 2 years ago

The command I use looks like this:

xvfb-run -d ${MPICASA} -n $SLURM_NTASKS ${CASA} --logfile=${LOGFILENAME} --pipeline --nogui --nologger --rcdir=/tmp -c "execfile('$(basename ${script})')"
keflavich commented 2 years ago

Useful reading for my reference: https://almascience.eso.org/euarcdata/itrain01/I-TRAIN-01_Imaging_Pipeline_Tafoya.pdf

Also the issue noted above about 'missing corrected data' is because I was re-running the pipeline in the same place and it was trying to re-split data that were already split. I've caught this error now.

keflavich commented 2 years ago

Additional reading: https://casaguides.nrao.edu/index.php?title=ALMA_Cycle_8_Imaging_Pipeline_Reprocessing

keflavich commented 2 years ago

Ah, so, I had a partial success once, but I ran into the classic 'failed in degridding' error: https://data.rc.ufl.edu/secure/adamginsburg/ACES/weblogs-reimaging/failures/pipeline-20220313T155510/html/t2-4m.html?sidebar=sidebar_stage4&ms=all&subpage=t2-4m_details.html

I bet that is this issue: "tclean: In order to run across a cluster, parallel=True should be specified in tclean. However, if savemodel='modelcolumn' is also specified, it triggers some kind of a race condition between the different nodes where they are competing for write access, and the task crashes. So setting savemodel='virtual' or savemodel='none' are the only options that work. Both the makePSF step and the major cycles of deconvolution are openMP aware, and can exploit additional resources specified via --cpus-per-task in the SLURM sbatch file." EDIT: no, it's not that, savemodel='none' in the failed stage

keflavich commented 2 years ago

from communication with Kumar Golap, some of the issues I reported above are this:

Exception for chan range [1404, 1413] ---   FilebufIO::readBlock - incorrect number of bytes read for file /orange/adamginsburg/ACES/rawdata/2021.1.00172.L/science_goal.uid___A001_X1590_X30a8/group.uid___A001_X1590_X30a9/member.uid___A001_X15a0_X130/calibrated/working/uid___A001_X15a0_X130.s11_0.Sgr_A_star_sci.spw27.cube.I.iter1.sumwt/table.f0

which is a known issue solved in 6.4.1 (https://open-jira.nrao.edu/browse/CAS-13694)

keflavich commented 2 years ago

The CASA pipeline prerelease is here: https://casa.nrao.edu/download/distro/casa/releaseprep/casa-6.4.3-2-pipeline-2021.3.0.17-py3.6.tar.xz

piposona commented 2 years ago

As reported in the meeting agenda today when trying to reproduce the issue in the JAO cluster:

To be discussed. I assume the error should be obvious and processing should have been stopped.

keflavich commented 2 years ago

Yes, it should have. I think this is an architecture issue, unfortunately. Kumar seems to be on it, but there is still no open ticketing system for us to keep track of it.

keflavich commented 2 years ago

Closing as "unfixable": We've given up using MPI.