ATNF / yandasoft

Astronomical Calibration and Imaging Software
Other
16 stars 8 forks source link

Selavy within Singularity failing #12

Closed davepallot closed 4 years ago

davepallot commented 4 years ago

I am attempting to run selavy within openmpi:

mpirun.openmpi --use-hwthread-cpus -np 3 singularity exec yandasoft-openmpi-4.0.2.simg selavy -c ./emu_data/selavy.conf

It is failing to start with the following error:

Debug: registered context Global=0 Debug: registered context Global=0 Debug: registered context Global=0 INFO analysis.askapparallel (1, emu-01) [2020-05-13 03:53:19,206] - ASKAP selavy (parallel) running on 3 nodes (worker 1) INFO analysis.askapparallel (1, emu-01) [2020-05-13 03:53:19,206] - 0.0.0 INFO analysis.askapparallel (1, emu-01) [2020-05-13 03:53:19,206] - Compiled without OpenMP support INFO selavy.log (1, emu-01) [2020-05-13 03:53:19,207] - ASKAP source finder ASKAPANALYSIS_VERSION_MAJOR:ASKAPANALYSIS_VERSION_MINOR:ASKAPANALYSIS_VERSION_PATCH INFO analysis.askapparallel (2, emu-01) [2020-05-13 03:53:19,206] - ASKAP selavy (parallel) running on 3 nodes (worker 2) INFO analysis.askapparallel (2, emu-01) [2020-05-13 03:53:19,206] - 0.0.0 INFO analysis.askapparallel (2, emu-01) [2020-05-13 03:53:19,206] - Compiled without OpenMP support INFO selavy.log (2, emu-01) [2020-05-13 03:53:19,206] - ASKAP source finder ASKAPANALYSIS_VERSION_MAJOR:ASKAPANALYSIS_VERSION_MINOR:ASKAPANALYSIS_VERSION_PATCH INFO analysis.askapparallel (0, emu-01) [2020-05-13 03:53:19,206] - ASKAP selavy (parallel) running on 3 nodes (master/master) INFO analysis.askapparallel (0, emu-01) [2020-05-13 03:53:19,207] - 0.0.0 INFO analysis.askapparallel (0, emu-01) [2020-05-13 03:53:19,207] - Compiled without OpenMP support INFO selavy.log (0, emu-01) [2020-05-13 03:53:19,207] - ASKAP source finder ASKAPANALYSIS_VERSION_MAJOR:ASKAPANALYSIS_VERSION_MINOR:ASKAPANALYSIS_VERSION_PATCH INFO selavy.log (0, emu-01) [2020-05-13 03:53:19,207] - Parset file contents: Selavy.findSpectralTerms=[true, false] Selavy.Fitter.doFit=true Selavy.Fitter.fitTypes=[full] Selavy.Fitter.maxReducedChisq=10. Selavy.flagGrowth=true Selavy.growthThreshold=3 Selavy.image=/data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits Selavy.imagetype=fits Selavy.minChannels=1 Selavy.minPix=3 Selavy.minVoxels=3 Selavy.nsubx=5 Selavy.nsuby=3 Selavy.snrCut=5 Selavy.sortingParam=-pflux Selavy.spectralTermsFromTaylor=true Selavy.threshSpatial=5 Selavy.VariableThreshold=false Selavy.VariableThreshold.boxSize=500 Selavy.Weights.weightsCutoff=0.09 Selavy.Weights.weightsImage=/data/emu_data/weights.i.pilot10.cont.linmos.taylor.0.fits

INFO analysis.weighter (0, emu-01) [2020-05-13 03:53:19,207] - Using weights image: /data/emu_data/weights.i.pilot10.cont.linmos.taylor.0.fits WARN analysis.varthresh (2, emu-01) [2020-05-13 03:53:19,215] - Variable Thresholder: reuse=true, but no SNR image name given. Turning reuse off. INFO analysis.duchampinterface (2, emu-01) [2020-05-13 03:53:19,215] - Changing the mask output file from /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.MASK.fits to image.i.pilot10.cont.linmos.taylor.0.MASK.fits WARN analysis.varthresh (1, emu-01) [2020-05-13 03:53:19,216] - Variable Thresholder: reuse=true, but no SNR image name given. Turning reuse off. WARN analysis.varthresh (0, emu-01) [2020-05-13 03:53:19,216] - Variable Thresholder: reuse=true, but no SNR image name given. Turning reuse off. INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,216] - Initialising parallel finder, based on Duchamp v1.6.2 INFO analysis.duchampinterface (1, emu-01) [2020-05-13 03:53:19,216] - Changing the mask output file from /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.MASK.fits to image.i.pilot10.cont.linmos.taylor.0.MASK.fits INFO analysis.duchampinterface (0, emu-01) [2020-05-13 03:53:19,216] - Changing the mask output file from /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.MASK.fits to image.i.pilot10.cont.linmos.taylor.0.MASK.fits INFO analysis.CASA (1, emu-01) [2020-05-13 03:53:19,255] - FITSCoordinateUtil::fromFITSHeader::MPIServer-1: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.CASA (2, emu-01) [2020-05-13 03:53:19,255] - FITSCoordinateUtil::fromFITSHeader::MPIServer-2: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.CASA (0, emu-01) [2020-05-13 03:53:19,256] - FITSCoordinateUtil::fromFITSHeader: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.parallelanalysis (1, emu-01) [2020-05-13 03:53:19,262] - Changed Subimage overlaps to 3,3,0 INFO analysis.parallelanalysis (2, emu-01) [2020-05-13 03:53:19,262] - Changed Subimage overlaps to 3,3,0 INFO analysis.CASA (1, emu-01) [2020-05-13 03:53:19,263] - FITSCoordinateUtil::fromFITSHeader::MPIServer-1: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,263] - Changed Subimage overlaps to 3,3,0 INFO selavy.log (0, emu-01) [2020-05-13 03:53:19,263] - Parset file as used: binaryCatalogue=selavy-catalogue.dpc casaFile=selavy-results.crf ds9File=selavy-results.reg findSpectralTerms=[true, false] fitAnnotationFile=selavy-fitResults.ann fitBoxAnnotationFile=selavy-fitResults.boxes.ann fitResultsFile=selavy-fitResults.txt Fitter.doFit=true Fitter.fitTypes=[full] Fitter.maxReducedChisq=10. flagGrowth=true growthThreshold=3 headerFile=selavy-results.hdr image=/data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits imagetype=fits karmaFile=selavy-results.ann logFile=selavy-Logfile.txt minChannels=1 minPix=3 minVoxels=3 nsubx=5 nsuby=3 overlapx=3 overlapy=3 overlapz=0 resultsFile=selavy-results.txt snrCut=5 sortingParam=-pflux spectralTermsFromTaylor=true spectraTextFile=selavy-spectra.txt subimageAnnotationFile=selavy-SubimageLocations.ann threshSpatial=5 VariableThreshold=false VariableThreshold.boxSize=500 votFile=selavy-results.xml Weights.weightsCutoff=0.09 Weights.weightsImage=/data/emu_data/weights.i.pilot10.cont.linmos.taylor.0.fits

INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,263] - About to read metadata from image /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits INFO analysis.CASA (2, emu-01) [2020-05-13 03:53:19,264] - FITSCoordinateUtil::fromFITSHeader::MPIServer-2: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.parallelanalysis (1, emu-01) [2020-05-13 03:53:19,264] - Dimensions of input image = 44911 x 33569 x 1 x 1 INFO analysis.CASA (0, emu-01) [2020-05-13 03:53:19,264] - FITSCoordinateUtil::fromFITSHeader: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.parallelanalysis (1, emu-01) [2020-05-13 03:53:19,264] - Using subsection [1:8983,1:11190,1:1,1:1] INFO analysis.parallelanalysis (2, emu-01) [2020-05-13 03:53:19,265] - Dimensions of input image = 44911 x 33569 x 1 x 1 INFO analysis.parallelanalysis (2, emu-01) [2020-05-13 03:53:19,266] - Using subsection [8982:17965,1:11190,1:1,1:1] INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,266] - Dimensions of input image = 44911 x 33569 x 1 x 1 INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,266] - Using subsection [1:44911,1:33569,1:1,1:1] INFO analysis.parallelanalysis (1, emu-01) [2020-05-13 03:53:19,267] - Reading data from image /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits INFO analysis.parallelanalysis (2, emu-01) [2020-05-13 03:53:19,268] - Reading data from image /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits INFO analysis.casainterface (0, emu-01) [2020-05-13 03:53:19,269] - Read beam from casa image: [13.9585 arcsec, 10.8968 arcsec, -58.1545 deg] INFO analysis.subimagedef (0, emu-01) [2020-05-13 03:53:19,269] - Input subsection to be used is [,,,] with dimensions 44911x33569x1x1 INFO analysis.subimagedef (0, emu-01) [2020-05-13 03:53:19,270] - Writing annotation file showing subimages to selavy-SubimageLocations.ann INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,270] - Read metadata from image /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,270] - Dimensions are 44911 33569 1 [emu-01:274595] Process received signal [emu-01:274595] Signal: Segmentation fault (11) [emu-01:274595] Signal code: Address not mapped (1) [emu-01:274595] Failing at address: 0xffffffffffffff18 [emu-01:274595] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f205c2f8f20] [emu-01:274595] [ 1] /usr/local/lib/libmpi.so.40(MPI_Recv+0x20c)[0x7f2056dae7fc] [emu-01:274595] [ 2] /usr/local/lib/libaskap_askapparallel.so(_ZN5askap13askapparallel8MPIComms11receiveImplEPvmiim+0xab)[0x7f205cc6efb5] [emu-01:274595] [ 3] /usr/local/lib/libaskap_askapparallel.so(_ZN5askap13askapparallel8MPIComms7receiveEPvmiim+0x43)[0x7f205cc6eec3] [emu-01:274595] [ 4] /usr/local/lib/libaskap_askapparallel.so(_ZN5askap13askapparallel13AskapParallel11receiveBlobERN5LOFAR10BlobStringEi+0x64)[0x7f205cc66f24] [emu-01:274595] [ 5] /usr/local/lib/libaskap_analysis.so(_ZN5askap8analysis8Weighter8findNormEv+0x6b4)[0x7f205e40e00c] [emu-01:274595] [ 6] /usr/local/lib/libaskap_analysis.so(_ZN5askap8analysis8Weighter10initialiseERN7duchamp4CubeEb+0x5f)[0x7f205e40d367] [emu-01:274595] [ 7] /usr/local/lib/libaskap_analysis.so(_ZN5askap8analysis15DuchampParallel10preprocessEv+0x185)[0x7f205e3ecc45] [emu-01:274595] [ 8] selavy(_ZN9SelavyApp3runEiPPc+0x5d8)[0x556d1c7f29a0] [emu-01:274595] [ 9] /usr/local/lib/libaskap_askap.so(_ZN5askap11Application4mainEiPPc+0xfb)[0x7f205db2ff59] [emu-01:274595] [10] selavy(main+0x55)[0x556d1c7ef936] [emu-01:274595] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f205c2dbb97] [emu-01:274595] [12] selavy(_start+0x2a)[0x556d1c7ee30a] [emu-01:274595] End of error message

steve-ord commented 4 years ago

openmpi is notoriously version specific. And with singularity we have to be ABI compliant within the container - that's why there are so many containers.

Was your singularity built with the same version of openmpi and is in the container .....

davepallot commented 4 years ago

Im running openmpi yandasoft-openmpi-4.0.2.simg with mpirun (Open MPI) 4.0.3 on my system.

steve-ord commented 4 years ago

But was singularity compiled with openmpi 4 that is actually the important dependency. My understanding is the MPI calls are replaced by singularity and they have to match

steve-ord commented 4 years ago

Oh and do any other tasks in the container work - in case it is just a selavy issue

davepallot commented 4 years ago

I'm not sure I pulled if from here: https://github.com/ATNF/yandasoft/wiki/How-to-install-using-container

Using: singularity pull docker://csirocass/yandasoft-openmpi-4.0.2

Linmos works within the same container and scales fine.

prlahur commented 4 years ago

Looks like the parallel process is already working. From the top of the log file, master and 2 workers are working

prlahur commented 4 years ago

Hi @davepallot , I wonder if you ran into the same problem with different dataset?

davepallot commented 4 years ago

@prlahur I haven't attempted to run a different dataset. Im going to try with a native yandsoft build.

mattwhiting commented 4 years ago

It's possible it is related to the size of the image, although getting a segfault is not necessarily what you'd expect. However, as far as I can tell the communication where it is failing is not sending large amounts of data, so I suspect it is more likely to be a library issue.

davepallot commented 4 years ago

@mattwhiting Would you like me to provide you with the linmos input data cube and selavy configuration script so you can possibly debug the issue?

jmarvil commented 4 years ago

I put together a small test data set and parset that may be useful for testing: https://data.pawsey.org.au/download/emushare/test_selavy/test_selavy.tar

This should run in less than a minute and find about 30 sources.

I can run this with my native build like such: /usr/lib64/mpich-3.2/bin/mpirun -n 10 selavy -c selavy.in

but when I try running with the container like this: /usr/lib64/mpich-3.2/bin/mpirun -n 10 singularity run yandasoft-mpich_latest.sif selavy -c selavy.in

I get either a send or receive error (one or the other, it changes each time I run it): ... INFO analysis.parallelanalysis (5, pylos) [2020-07-24 13:24:57,588] - Dimensions of input image = 418 x 346 INFO analysis.parallelanalysis (5, pylos) [2020-07-24 13:24:57,588] - Using subsection [90:328,66:280] INFO analysis.CASA (3, pylos) [2020-07-24 13:24:57,588] - FITSCoordinateUtil::fromFITSHeader: passing empty or nonexistant spectral Coordinate axis INFO analysis.parallelanalysis (2, pylos) [2020-07-24 13:24:57,588] - Reading data from image image.taylor.0.fits Fatal error in MPI_Send: Invalid communicator, error stack: MPI_Send(174): MPI_Send(buf=0x7ffe675c00a0, count=1, MPI_UNSIGNED_LONG, dest=0, tag=0, comm=0x0) failed MPI_Send(82).: Invalid communicator

or: ... INFO analysis.parallelanalysis (0, pylos) [2020-07-24 13:32:44,594] - Read metadata from image image.taylor.0.fits INFO analysis.parallelanalysis (0, pylos) [2020-07-24 13:32:44,594] - Dimensions are 418 346 1 INFO analysis.CASA (1, pylos) [2020-07-24 13:32:44,594] - FITSCoordinateUtil::fromFITSHeader: passing empty or nonexistant spectral Coordinate axis INFO analysis.parallelanalysis (9, pylos) [2020-07-24 13:32:44,594] - Using subsection [229:418,181:346] Fatal error in MPI_Recv: Invalid communicator, error stack: MPI_Recv(200): MPI_Recv(buf=0x7ffc3433a460, count=1, MPI_UNSIGNED_LONG, src=1, tag=0, comm=0x5559, status=0x7ffc3433a480) failed MPI_Recv(86).: Invalid communicator

I also tried with a different container but that fails to start:

mpirun -np 2 singularity run yandasoft-openmpi-4.0.2_latest.sif selavy selavy: error while loading shared libraries: libopen-rte.so.40: cannot open shared object file: No such file or directory

steve-ord commented 4 years ago

That is a very odd error - looks more like an MPI error than a Selavy error. I'll try this myself sometime over the w/e and see if I can replicate it.

davepallot commented 4 years ago

@jmarvil I managed to build my own yandsoft container with mpi v4.0.3. It appears to work on your test data. I will rerun with the full data set.

prlahur commented 4 years ago

@davepallot , @jmarvil , sorry for taking so long. I made some modification on the docker image. Following Pawsey's docker image, the MPICH implementation is now built from the souce code instead of using the simple apt-get. I tested that on Galaxy in Pawsey and it worked. I pushed the unofficial docker image here: https://hub.docker.com/r/lahur/yandasoft-mpich Warning: at the end of the run, it will give a lot of complaints about profiling. Please ignore this for the moment. As the message says, it's about profiling and it does not affect the result.

prlahur commented 4 years ago

Checked on another HPC (Pearcey), with module MPICH-3.3 loaded. The small test case works there too. BTW, the MPICH in Galaxy (in Pawsey) is Cray MPICH 7.7 (Cray has its own numbering). MPICH is compatible across versions and vendors, so the image linked above should work on other machines with MPICH too. This link is for my reference: https://jira.csiro.au/browse/AXA-615

davepallot commented 4 years ago

Thanks @prlahur I'll close this.