Closed keflavich closed 2 years ago
This segfault is persistent:
2022-03-02 21:16:43 INFO: casatools.imager.selectvis(vis='uid___A002_Xf53eeb_X323e.ms', field='Sgr_A_star', spw=33, scan='10,7,16,13', taql='ANTENNA1==0||
...
8||ANTENNA2==39||ANTENNA2==40', usescratch=False, writeaccess=False) CASA tool call took 11.562737s
srun: error: c0704a-s21: task 0: Segmentation fault (core dumped)
Looks like this is happening after the selectvis
call.
ah, actually, didn't need that large a selection:
2022-03-02 21:19:17 INFO: casatools.imager.selectvis(vis='uid___A002_Xf531c1_X16b2.ms', field='Sgr_A_star', spw=31, scan='10,7,16,13', taql='ANTENNA1==0||ANTENNA1==1||ANTENNA1==2||ANTENNA1==3||ANTENNA1==4||ANTENNA1==5||ANTENNA1==6||ANTENNA1==7||ANTENNA1==8||ANTENNA1==9||ANTENNA1==10||ANTENNA1==11||ANTENNA1==12||ANTENNA1==13||ANTENNA1==14||ANTENNA1==15||ANTENNA1==16||ANTENNA1==17||ANTENNA1==18||ANTENNA1==19||ANTENNA1==20||ANTENNA1==21||ANTENNA1==22||ANTENNA1==23||ANTENNA1==24||ANTENNA1==25||ANTENNA1==26||ANTENNA1==27||ANTENNA1==28||ANTENNA1==29||ANTENNA1==30||ANTENNA1==31||ANTENNA1==32||ANTENNA1==33||ANTENNA1==34||ANTENNA1==35||ANTENNA1==36||ANTENNA1==37||ANTENNA1==38||ANTENNA1==39||ANTENNA1==40&&ANTENNA2==0||ANTENNA2==1||ANTENNA2==2||ANTENNA2==3||ANTENNA2==4||ANTENNA2==5||ANTENNA2==6||ANTENNA2==7||ANTENNA2==8||ANTENNA2==9||ANTENNA2==10||ANTENNA2==11||ANTENNA2==12||ANTENNA2==13||ANTENNA2==14||ANTENNA2==15||ANTENNA2==16||ANTENNA2==17||ANTENNA2==18||ANTENNA2==19||ANTENNA2==20||ANTENNA2==21||ANTENNA2==22||ANTENNA2==23||ANTENNA2==24||ANTENNA2==25||ANTENNA2==26||ANTENNA2==27||ANTENNA2==28||ANTENNA2==29||ANTENNA2==30||ANTENNA2==31||ANTENNA2==32||ANTENNA2==33||ANTENNA2==34||ANTENNA2==35||ANTENNA2==36||ANTENNA2==37||ANTENNA2==38||ANTENNA2==39||ANTENNA2==40', usescratch=False, writeaccess=False) CASA tool call took 11.562737s
srun: error: c0704a-s21: task 0: Segmentation fault (core dumped)
This occurs in the file pipeline/hif/heuristics/imageparams_base.py
in the synthesized_beams
function
afaict, the next thing CASA does is:
rtn = casa_tools.imager.advise(takeadvice=False, amplitudeloss=0.5, fieldofview=aipsfieldofview)
casa_tools.imager.done()
but there's a lot of stuff that gets done w/o any log messages, so I can't tell for sure if the failure is on that or earlier or later.
I've narrowed it down to the advise line.
I managed to get more information out by running w/mpirun:
2022-03-08 15:48:46 INFO: Running casa_tools.imager.advise with fieldofview=126.3arcsec
[c0706a-s15:124498] *** Process received signal ***
[c0706a-s15:124498] Signal: Segmentation fault (11)
[c0706a-s15:124498] Signal code: Invalid permissions (2)
[c0706a-s15:124498] Failing at address: 0x92e3040
[c0706a-s15:124498] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aedfab14630]
[c0706a-s15:124498] [ 1] [0x92e3040]
[c0706a-s15:124498] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 124498 on node c0706a-s15 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Data processing team at JAO contacted, and waiting for a reply, in case this is obvious to them.
There's no obvious sign of corrupted data; the downloaded files all have the right file size and the ASDM tarballs extracted without issue. I'm deleting the calibrated/ directory and re-running the pipeline anyway, when I get to it.
I got an answer from them which I translate here (since that was in Spanish).
Segmentation fault is too obscure as it could be anything. I have been checking the CASA log and I found the following
2022-03-02 16:17:47 INFO hifa_imageprecheck::::casa ##########################################
2022-03-02 16:17:47 INFO hifa_imageprecheck::::casa ##### Begin Task: hifa_imageprecheck #####
2022-03-02 16:17:47 INFO hifa_imageprecheck::::casa hifa_imageprecheck( vis=[], calcsb=False, parallel='automatic', pipelinemode='automatic', dryrun=False, acceptresults=True )
2022-03-02 16:17:47 INFO hifa_imageprecheck::pipeline.infrastructure.basetask::casa Starting execution for stage 2
2022-03-02 16:17:47 INFO hifa_imageprecheck::pipeline.infrastructure.basetask::casa Equivalent CASA call: hifa_imageprecheck(parallel='automatic', calcsb=False)
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa Error executing pipeline task hifa_imageprecheck.
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa Traceback (most recent call last):
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa+ File "/blue/adamginsburg/adamginsburg/casa/casa-6.2.1-7-pipeline-2021.2.0.128/lib/py/lib/python3.6/site-packages/Pipeline-2021.2.0.128-py3.6.egg/pipeline/infrastructure/basetask.py", line 660, in execute
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa+ result = self.prepare(**prepare_parameters)
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa+ File "/blue/adamginsburg/adamginsburg/casa/casa-6.2.1-7-pipeline-2021.2.0.128/lib/py/lib/python3.6/site-packages/Pipeline-2021.2.0.128-py3.6.egg/pipeline/hifa/tasks/imageprecheck/imageprecheck.py", line 150, in prepare
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa+ repr_target, repr_source, repr_spw, repr_freq, reprBW_mode, real_repr_target, minAcceptableAngResolution, maxAcceptableAngResolution, maxAllowedBeamAxialRatio, sensitivityGoal = image_heuristics.representative_target()
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa+ File "/blue/adamginsburg/adamginsburg/casa/casa-6.2.1-7-pipeline-2021.2.0.128/lib/py/lib/python3.6/site-packages/Pipeline-2021.2.0.128-py3.6.egg/pipeline/hif/heuristics/imageparams_base.py", line 911, in representative_target
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa+ repr_target = repr_ms.representative_target
2022-03-02 16:17:47 SEVERE hifa_imageprecheck::pipeline.infrastructure.basetask::casa+ AttributeError: 'NoneType' object has no attribute 'representative_target'
2022-03-02 16:17:48 SEVERE hifa_imageprecheck::pipeline.h.cli.utils::casa Warning: Check merge to context for FailedTaskResults
2022-03-02 16:17:48 SEVERE hifa_imageprecheck::::casa Task hifa_imageprecheck raised an exception of class TypeError with the following message: unsupported operand type(s) for -: 'NoneType' and 'datetime.datetime'
2022-03-02 16:17:48 INFO hifa_imageprecheck::::casa Task hifa_imageprecheck complete. Start time: 2022-03-02 11:17:47.217459 End time: 2022-03-02 11:17:47.959564
2022-03-02 16:17:48 INFO hifa_imageprecheck::::casa ##### End Task: hifa_imageprecheck #####
2022-03-02 16:17:48 INFO hifa_imageprecheck::::casa ##########################################
No idea what you are trying to do, but it seems that you are not including the representative target in the pipeline execution, and therefore the imageprecheck task fails. Maybe you split the ms and removed that target? Further down the CASA log it complains about the same in hif_makeimlist task Looking at the other log (not sure where it comes from) it appears the following
2022-03-02 16:50:49 SEVERE mstransform::::casa Task mstransform raised an exception of class RuntimeError with the following message: Desired column (CORRECTED_DATA) not found in the input MS (/orange/adamginsburg/ACES/rawdata/2021.1.00172.L/science_goal.uid___A001_X1590_X30a8/group.uid___A001_X1590_X30a9/member.uid___A001_X15a0_X190/calibrated/working/uid___A002_Xf53eeb_X323e_target.ms).
That is, it seems that you are not using a calibrated MS (i.e. containing CORRECTED_DATA column)?
She is happy to further help, but probably at this point we need to explain to her, exactly what we did with the data after downloading them from archive. I guess there is not any major change or any split involved, right?
Thanks, this is helpful! No, I did nothing except run the pipeline.
I re-ran things from scratch, now encountering a different error (which could be related to mpi, perhaps):
2022-03-10 19:52:31 INFO: Executing plotms(vis='uid___A002_Xf531c1_X16b2.ms', xaxis='azimuth', yaxis='elevation', spw='25:0~0,27:0~0,29:0~0,31:0~0,33:0~0,35:0~0', antenna='0&&*', avgchannel='9000', avgtime='10', coloraxis='field', customflaggedsymbol=True, flaggedsymbolshape='autoscaling', title='Elevation vs Azimuth for uid___A002_Xf531c1_X16b2.ms', plotfile='azel.png', showgui=False, clearplots=True)
fuse: failed to exec fusermount: No such file or directory
Cannot mount AppImage, please check your FUSE setup.
You might still be able to extract the contents of this AppImage
if you run it with the --appimage-extract option.
See https://github.com/AppImage/AppImageKit/wiki/FUSE
for more information
open dir error: No such file or directory
Do you want me to download the data from the archive (not from globus) and run the scrip for PI from scratch, to see if I reproduce the error?
This isn't the scriptForPI, this is reimaging:
from casarecipes.almahelpers import fixsyscaltimes # SACM/JAO - Fixes
from casatasks import fixplanets
context = h_init()
context.set_state('ProjectSummary', 'proposal_code', '2021.1.00172.L')
context.set_state('ProjectSummary', 'proposal_title', 'unknown')
context.set_state('ProjectSummary', 'piname', 'unknown')
import os
import sys
cwd = os.getcwd()
mous = cwd.split("/")[-3].split(".")[1].split("_")
ous_entity_id = f"{mous[0]}://{mous[3]}/{mous[4]}/{mous[5]}"
context.set_state('ProjectStructure', 'ous_entity_id', ous_entity_id)
context.set_state('ProjectStructure', 'ous_part_id', 'Undefined')
context.set_state('ProjectStructure', 'ous_title', 'Undefined')
context.set_state('ProjectStructure', 'ousstatus_entity_id', ous_entity_id)
context.set_state('ProjectStructure', 'recipe_name', 'hifa_calimage_renorm')
import glob
mses = glob.glob("*.ms")
if len(mses) == 0:
raise ValueError("No MSes found")
print(f"Starting imaging pipeline rerun for mses {mses} and for mouse {mous} in {cwd}")
try:
hifa_importdata(vis=mses, dbservice=False)
hif_mstransform(pipelinemode="automatic")
#hifa_flagtargets(pipelinemode="automatic")
#hifa_imageprecheck(pipelinemode="automatic")
hif_makeimlist(specmode='mfs')
hif_findcont(pipelinemode="automatic")
hif_uvcontfit(pipelinemode="automatic", fitorder=0)
hif_uvcontsub(pipelinemode="automatic")
hif_makeimages(pipelinemode="automatic")
hif_makeimlist(specmode='cont')
hif_makeimages(pipelinemode="automatic")
hif_makeimlist(specmode='cube')
hif_makeimages(pipelinemode="automatic")
finally:
h_save()
But, yes, that would be very helpful!
I will try, but I may need to be given access in the alma portal to download from the archive . Unless you want me to try from globus. If i manage to reproduce the error here, that would be much easier to share with data managers.
The current error doesn't look like data corruption, so downloading from globus is fine
Ok, i will try. Give me a few days...
The latest issue, about fusermount
failing, has a very hacky workaround.
There's supposed to be an easy, well-supported workaround, which is to do:
export APPIMAGE_EXTRACT_AND_RUN=1
${CASAPATH}/lib/py/lib/python3.6/site-packages/casaplotms/__bin__/casaplotms-x86_64.AppImage
which is what CASA calls internally when you call plotms
. However, that environmental variable had no effect when I set it:
APPIMAGE_EXTRACT_AND_RUN=1 /orange/adamginsburg/casa/casa-6.2.1-7-pipeline-2021.2.0.128/lib/py/lib/python3.6/site-packages/casaplotms/__bin__/casaplotms-x86_64.AppImage
fuse: failed to exec fusermount: No such file or directory
Cannot mount AppImage, please check your FUSE setup.
You might still be able to extract the contents of this AppImage
if you run it with the --appimage-extract option.
See https://github.com/AppImage/AppImageKit/wiki/FUSE
for more information
open dir error: No such file or directory
I have verified that this exact syntax works with CARTA, so it's something unique about CASA's tools - possibly the AppImage version.
So, the hack is to change the CASA source:
__proc = Popen( [ app_path, '--appimage-extract',
'--nopopups', '--logfilename=%s' % casalog.logfile( ),
'--server=%s' % ctsys.registry( )['uri'] ] + data_path,
# stdin=DEVNULL, stdout=DEVNULL, stderr=STDOUT,
# stdin=DEVNULL, stdout=STDOUT, stderr=STDOUT,
close_fds=True )
__proc = Popen( [ app_path, '--appimage-extract', '--nopopups', '--logfilename=%s' % casalog.logfile( ), '--server=%s' % np_path ] + data_path )
to add the --appimage-extract
argument as the first argument after the app name (if it is not the first argument, this doesn't work).
This is a terrible and unsustainable solution, but for now it works.
the hack above changed the behavior of plotms, but didn't solve any problems
I've gotten it working by making __launch()
return immediately in lib/py/lib/python3.6/site-packages/casaplotms/private/plotmstool.py
. Again, this is an atrocious hack that guarantees no plots will be made, but at least it's running the pipeline now.
I was about to run the script today, since the data were just downloaded. Does this mean it is not needed?
no no, I had to hack CASA to death (it is really dead now) to get it to run. I think it works now, but not in a nice and reproducible way!
Believe it or not, the script is still running. Just started the first hif_makeimages(pipelinemode="automatic").
you're running with MPI, right? It is quite clear, on my end, that the segfaults, etc. are caused by MPI
Oh! That is a good question! I think I am not, or at least I did not force it in any way. How do you make it work with MPI? After this thing is done (or even in parallel) I could try with MPI.
The command I use looks like this:
xvfb-run -d ${MPICASA} -n $SLURM_NTASKS ${CASA} --logfile=${LOGFILENAME} --pipeline --nogui --nologger --rcdir=/tmp -c "execfile('$(basename ${script})')"
Useful reading for my reference: https://almascience.eso.org/euarcdata/itrain01/I-TRAIN-01_Imaging_Pipeline_Tafoya.pdf
Also the issue noted above about 'missing corrected data' is because I was re-running the pipeline in the same place and it was trying to re-split data that were already split. I've caught this error now.
Ah, so, I had a partial success once, but I ran into the classic 'failed in degridding' error: https://data.rc.ufl.edu/secure/adamginsburg/ACES/weblogs-reimaging/failures/pipeline-20220313T155510/html/t2-4m.html?sidebar=sidebar_stage4&ms=all&subpage=t2-4m_details.html
I bet that is this issue:
"tclean: In order to run across a cluster, parallel=True should be specified in tclean. However, if savemodel='modelcolumn' is also specified, it triggers some kind of a race condition between the different nodes where they are competing for write access, and the task crashes. So setting savemodel='virtual' or savemodel='none' are the only options that work. Both the makePSF step and the major cycles of deconvolution are openMP aware, and can exploit additional resources specified via --cpus-per-task in the SLURM sbatch file."
EDIT: no, it's not that, savemodel='none'
in the failed stage
from communication with Kumar Golap, some of the issues I reported above are this:
Exception for chan range [1404, 1413] --- FilebufIO::readBlock - incorrect number of bytes read for file /orange/adamginsburg/ACES/rawdata/2021.1.00172.L/science_goal.uid___A001_X1590_X30a8/group.uid___A001_X1590_X30a9/member.uid___A001_X15a0_X130/calibrated/working/uid___A001_X15a0_X130.s11_0.Sgr_A_star_sci.spw27.cube.I.iter1.sumwt/table.f0
which is a known issue solved in 6.4.1 (https://open-jira.nrao.edu/browse/CAS-13694)
The CASA pipeline prerelease is here: https://casa.nrao.edu/download/distro/casa/releaseprep/casa-6.4.3-2-pipeline-2021.3.0.17-py3.6.tar.xz
As reported in the meeting agenda today when trying to reproduce the issue in the JAO cluster:
To be discussed. I assume the error should be obvious and processing should have been stopped.
Yes, it should have. I think this is an architecture issue, unfortunately. Kumar seems to be on it, but there is still no open ticketing system for us to keep track of it.
Closing as "unfixable": We've given up using MPI.
In attempting to re-image 190, I get the following logs:
run_inplace_mpi_19876601.log casa-20220302-161732.log
The run failed with:
(
srun
is being used in place ofmpirun
following recommendations from the HPC team here)The failure happens in the imageprecheck step, which I'll try skipping.
The casa log is from an earlier run, I think, and complains about bad pickling.