Closed davepallot closed 4 years ago
openmpi is notoriously version specific. And with singularity we have to be ABI compliant within the container - that's why there are so many containers.
Was your singularity built with the same version of openmpi and is in the container .....
Im running openmpi yandasoft-openmpi-4.0.2.simg with mpirun (Open MPI) 4.0.3 on my system.
But was singularity compiled with openmpi 4 that is actually the important dependency. My understanding is the MPI calls are replaced by singularity and they have to match
Oh and do any other tasks in the container work - in case it is just a selavy issue
I'm not sure I pulled if from here:
https://github.com/ATNF/yandasoft/wiki/How-to-install-using-container
Using:
singularity pull docker://csirocass/yandasoft-openmpi-4.0.2
Linmos works within the same container and scales fine.
Looks like the parallel process is already working. From the top of the log file, master and 2 workers are working
Hi @davepallot , I wonder if you ran into the same problem with different dataset?
@prlahur I haven't attempted to run a different dataset. Im going to try with a native yandsoft build.
It's possible it is related to the size of the image, although getting a segfault is not necessarily what you'd expect. However, as far as I can tell the communication where it is failing is not sending large amounts of data, so I suspect it is more likely to be a library issue.
@mattwhiting Would you like me to provide you with the linmos input data cube and selavy configuration script so you can possibly debug the issue?
I put together a small test data set and parset that may be useful for testing: https://data.pawsey.org.au/download/emushare/test_selavy/test_selavy.tar
This should run in less than a minute and find about 30 sources.
I can run this with my native build like such: /usr/lib64/mpich-3.2/bin/mpirun -n 10 selavy -c selavy.in
but when I try running with the container like this: /usr/lib64/mpich-3.2/bin/mpirun -n 10 singularity run yandasoft-mpich_latest.sif selavy -c selavy.in
I get either a send or receive error (one or the other, it changes each time I run it): ... INFO analysis.parallelanalysis (5, pylos) [2020-07-24 13:24:57,588] - Dimensions of input image = 418 x 346 INFO analysis.parallelanalysis (5, pylos) [2020-07-24 13:24:57,588] - Using subsection [90:328,66:280] INFO analysis.CASA (3, pylos) [2020-07-24 13:24:57,588] - FITSCoordinateUtil::fromFITSHeader: passing empty or nonexistant spectral Coordinate axis INFO analysis.parallelanalysis (2, pylos) [2020-07-24 13:24:57,588] - Reading data from image image.taylor.0.fits Fatal error in MPI_Send: Invalid communicator, error stack: MPI_Send(174): MPI_Send(buf=0x7ffe675c00a0, count=1, MPI_UNSIGNED_LONG, dest=0, tag=0, comm=0x0) failed MPI_Send(82).: Invalid communicator
or: ... INFO analysis.parallelanalysis (0, pylos) [2020-07-24 13:32:44,594] - Read metadata from image image.taylor.0.fits INFO analysis.parallelanalysis (0, pylos) [2020-07-24 13:32:44,594] - Dimensions are 418 346 1 INFO analysis.CASA (1, pylos) [2020-07-24 13:32:44,594] - FITSCoordinateUtil::fromFITSHeader: passing empty or nonexistant spectral Coordinate axis INFO analysis.parallelanalysis (9, pylos) [2020-07-24 13:32:44,594] - Using subsection [229:418,181:346] Fatal error in MPI_Recv: Invalid communicator, error stack: MPI_Recv(200): MPI_Recv(buf=0x7ffc3433a460, count=1, MPI_UNSIGNED_LONG, src=1, tag=0, comm=0x5559, status=0x7ffc3433a480) failed MPI_Recv(86).: Invalid communicator
I also tried with a different container but that fails to start:
mpirun -np 2 singularity run yandasoft-openmpi-4.0.2_latest.sif selavy selavy: error while loading shared libraries: libopen-rte.so.40: cannot open shared object file: No such file or directory
That is a very odd error - looks more like an MPI error than a Selavy error. I'll try this myself sometime over the w/e and see if I can replicate it.
@jmarvil I managed to build my own yandsoft container with mpi v4.0.3. It appears to work on your test data. I will rerun with the full data set.
@davepallot , @jmarvil , sorry for taking so long. I made some modification on the docker image. Following Pawsey's docker image, the MPICH implementation is now built from the souce code instead of using the simple apt-get. I tested that on Galaxy in Pawsey and it worked. I pushed the unofficial docker image here: https://hub.docker.com/r/lahur/yandasoft-mpich Warning: at the end of the run, it will give a lot of complaints about profiling. Please ignore this for the moment. As the message says, it's about profiling and it does not affect the result.
Checked on another HPC (Pearcey), with module MPICH-3.3 loaded. The small test case works there too. BTW, the MPICH in Galaxy (in Pawsey) is Cray MPICH 7.7 (Cray has its own numbering). MPICH is compatible across versions and vendors, so the image linked above should work on other machines with MPICH too. This link is for my reference: https://jira.csiro.au/browse/AXA-615
Thanks @prlahur I'll close this.
I am attempting to run selavy within openmpi:
mpirun.openmpi --use-hwthread-cpus -np 3 singularity exec yandasoft-openmpi-4.0.2.simg selavy -c ./emu_data/selavy.conf
It is failing to start with the following error:
Debug: registered context Global=0 Debug: registered context Global=0 Debug: registered context Global=0 INFO analysis.askapparallel (1, emu-01) [2020-05-13 03:53:19,206] - ASKAP selavy (parallel) running on 3 nodes (worker 1) INFO analysis.askapparallel (1, emu-01) [2020-05-13 03:53:19,206] - 0.0.0 INFO analysis.askapparallel (1, emu-01) [2020-05-13 03:53:19,206] - Compiled without OpenMP support INFO selavy.log (1, emu-01) [2020-05-13 03:53:19,207] - ASKAP source finder ASKAPANALYSIS_VERSION_MAJOR:ASKAPANALYSIS_VERSION_MINOR:ASKAPANALYSIS_VERSION_PATCH INFO analysis.askapparallel (2, emu-01) [2020-05-13 03:53:19,206] - ASKAP selavy (parallel) running on 3 nodes (worker 2) INFO analysis.askapparallel (2, emu-01) [2020-05-13 03:53:19,206] - 0.0.0 INFO analysis.askapparallel (2, emu-01) [2020-05-13 03:53:19,206] - Compiled without OpenMP support INFO selavy.log (2, emu-01) [2020-05-13 03:53:19,206] - ASKAP source finder ASKAPANALYSIS_VERSION_MAJOR:ASKAPANALYSIS_VERSION_MINOR:ASKAPANALYSIS_VERSION_PATCH INFO analysis.askapparallel (0, emu-01) [2020-05-13 03:53:19,206] - ASKAP selavy (parallel) running on 3 nodes (master/master) INFO analysis.askapparallel (0, emu-01) [2020-05-13 03:53:19,207] - 0.0.0 INFO analysis.askapparallel (0, emu-01) [2020-05-13 03:53:19,207] - Compiled without OpenMP support INFO selavy.log (0, emu-01) [2020-05-13 03:53:19,207] - ASKAP source finder ASKAPANALYSIS_VERSION_MAJOR:ASKAPANALYSIS_VERSION_MINOR:ASKAPANALYSIS_VERSION_PATCH INFO selavy.log (0, emu-01) [2020-05-13 03:53:19,207] - Parset file contents: Selavy.findSpectralTerms=[true, false] Selavy.Fitter.doFit=true Selavy.Fitter.fitTypes=[full] Selavy.Fitter.maxReducedChisq=10. Selavy.flagGrowth=true Selavy.growthThreshold=3 Selavy.image=/data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits Selavy.imagetype=fits Selavy.minChannels=1 Selavy.minPix=3 Selavy.minVoxels=3 Selavy.nsubx=5 Selavy.nsuby=3 Selavy.snrCut=5 Selavy.sortingParam=-pflux Selavy.spectralTermsFromTaylor=true Selavy.threshSpatial=5 Selavy.VariableThreshold=false Selavy.VariableThreshold.boxSize=500 Selavy.Weights.weightsCutoff=0.09 Selavy.Weights.weightsImage=/data/emu_data/weights.i.pilot10.cont.linmos.taylor.0.fits
INFO analysis.weighter (0, emu-01) [2020-05-13 03:53:19,207] - Using weights image: /data/emu_data/weights.i.pilot10.cont.linmos.taylor.0.fits WARN analysis.varthresh (2, emu-01) [2020-05-13 03:53:19,215] - Variable Thresholder: reuse=true, but no SNR image name given. Turning reuse off. INFO analysis.duchampinterface (2, emu-01) [2020-05-13 03:53:19,215] - Changing the mask output file from /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.MASK.fits to image.i.pilot10.cont.linmos.taylor.0.MASK.fits WARN analysis.varthresh (1, emu-01) [2020-05-13 03:53:19,216] - Variable Thresholder: reuse=true, but no SNR image name given. Turning reuse off. WARN analysis.varthresh (0, emu-01) [2020-05-13 03:53:19,216] - Variable Thresholder: reuse=true, but no SNR image name given. Turning reuse off. INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,216] - Initialising parallel finder, based on Duchamp v1.6.2 INFO analysis.duchampinterface (1, emu-01) [2020-05-13 03:53:19,216] - Changing the mask output file from /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.MASK.fits to image.i.pilot10.cont.linmos.taylor.0.MASK.fits INFO analysis.duchampinterface (0, emu-01) [2020-05-13 03:53:19,216] - Changing the mask output file from /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.MASK.fits to image.i.pilot10.cont.linmos.taylor.0.MASK.fits INFO analysis.CASA (1, emu-01) [2020-05-13 03:53:19,255] - FITSCoordinateUtil::fromFITSHeader::MPIServer-1: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.CASA (2, emu-01) [2020-05-13 03:53:19,255] - FITSCoordinateUtil::fromFITSHeader::MPIServer-2: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.CASA (0, emu-01) [2020-05-13 03:53:19,256] - FITSCoordinateUtil::fromFITSHeader: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.parallelanalysis (1, emu-01) [2020-05-13 03:53:19,262] - Changed Subimage overlaps to 3,3,0 INFO analysis.parallelanalysis (2, emu-01) [2020-05-13 03:53:19,262] - Changed Subimage overlaps to 3,3,0 INFO analysis.CASA (1, emu-01) [2020-05-13 03:53:19,263] - FITSCoordinateUtil::fromFITSHeader::MPIServer-1: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,263] - Changed Subimage overlaps to 3,3,0 INFO selavy.log (0, emu-01) [2020-05-13 03:53:19,263] - Parset file as used: binaryCatalogue=selavy-catalogue.dpc casaFile=selavy-results.crf ds9File=selavy-results.reg findSpectralTerms=[true, false] fitAnnotationFile=selavy-fitResults.ann fitBoxAnnotationFile=selavy-fitResults.boxes.ann fitResultsFile=selavy-fitResults.txt Fitter.doFit=true Fitter.fitTypes=[full] Fitter.maxReducedChisq=10. flagGrowth=true growthThreshold=3 headerFile=selavy-results.hdr image=/data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits imagetype=fits karmaFile=selavy-results.ann logFile=selavy-Logfile.txt minChannels=1 minPix=3 minVoxels=3 nsubx=5 nsuby=3 overlapx=3 overlapy=3 overlapz=0 resultsFile=selavy-results.txt snrCut=5 sortingParam=-pflux spectralTermsFromTaylor=true spectraTextFile=selavy-spectra.txt subimageAnnotationFile=selavy-SubimageLocations.ann threshSpatial=5 VariableThreshold=false VariableThreshold.boxSize=500 votFile=selavy-results.xml Weights.weightsCutoff=0.09 Weights.weightsImage=/data/emu_data/weights.i.pilot10.cont.linmos.taylor.0.fits
INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,263] - About to read metadata from image /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits INFO analysis.CASA (2, emu-01) [2020-05-13 03:53:19,264] - FITSCoordinateUtil::fromFITSHeader::MPIServer-2: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.parallelanalysis (1, emu-01) [2020-05-13 03:53:19,264] - Dimensions of input image = 44911 x 33569 x 1 x 1 INFO analysis.CASA (0, emu-01) [2020-05-13 03:53:19,264] - FITSCoordinateUtil::fromFITSHeader: Neither SPECSYS nor VELREF keyword given, spectral reference frame not defined ... INFO analysis.parallelanalysis (1, emu-01) [2020-05-13 03:53:19,264] - Using subsection [1:8983,1:11190,1:1,1:1] INFO analysis.parallelanalysis (2, emu-01) [2020-05-13 03:53:19,265] - Dimensions of input image = 44911 x 33569 x 1 x 1 INFO analysis.parallelanalysis (2, emu-01) [2020-05-13 03:53:19,266] - Using subsection [8982:17965,1:11190,1:1,1:1] INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,266] - Dimensions of input image = 44911 x 33569 x 1 x 1 INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,266] - Using subsection [1:44911,1:33569,1:1,1:1] INFO analysis.parallelanalysis (1, emu-01) [2020-05-13 03:53:19,267] - Reading data from image /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits INFO analysis.parallelanalysis (2, emu-01) [2020-05-13 03:53:19,268] - Reading data from image /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits INFO analysis.casainterface (0, emu-01) [2020-05-13 03:53:19,269] - Read beam from casa image: [13.9585 arcsec, 10.8968 arcsec, -58.1545 deg] INFO analysis.subimagedef (0, emu-01) [2020-05-13 03:53:19,269] - Input subsection to be used is [,,,] with dimensions 44911x33569x1x1 INFO analysis.subimagedef (0, emu-01) [2020-05-13 03:53:19,270] - Writing annotation file showing subimages to selavy-SubimageLocations.ann INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,270] - Read metadata from image /data/emu_data/image.i.pilot10.cont.linmos.taylor.0.fits INFO analysis.parallelanalysis (0, emu-01) [2020-05-13 03:53:19,270] - Dimensions are 44911 33569 1 [emu-01:274595] Process received signal [emu-01:274595] Signal: Segmentation fault (11) [emu-01:274595] Signal code: Address not mapped (1) [emu-01:274595] Failing at address: 0xffffffffffffff18 [emu-01:274595] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f205c2f8f20] [emu-01:274595] [ 1] /usr/local/lib/libmpi.so.40(MPI_Recv+0x20c)[0x7f2056dae7fc] [emu-01:274595] [ 2] /usr/local/lib/libaskap_askapparallel.so(_ZN5askap13askapparallel8MPIComms11receiveImplEPvmiim+0xab)[0x7f205cc6efb5] [emu-01:274595] [ 3] /usr/local/lib/libaskap_askapparallel.so(_ZN5askap13askapparallel8MPIComms7receiveEPvmiim+0x43)[0x7f205cc6eec3] [emu-01:274595] [ 4] /usr/local/lib/libaskap_askapparallel.so(_ZN5askap13askapparallel13AskapParallel11receiveBlobERN5LOFAR10BlobStringEi+0x64)[0x7f205cc66f24] [emu-01:274595] [ 5] /usr/local/lib/libaskap_analysis.so(_ZN5askap8analysis8Weighter8findNormEv+0x6b4)[0x7f205e40e00c] [emu-01:274595] [ 6] /usr/local/lib/libaskap_analysis.so(_ZN5askap8analysis8Weighter10initialiseERN7duchamp4CubeEb+0x5f)[0x7f205e40d367] [emu-01:274595] [ 7] /usr/local/lib/libaskap_analysis.so(_ZN5askap8analysis15DuchampParallel10preprocessEv+0x185)[0x7f205e3ecc45] [emu-01:274595] [ 8] selavy(_ZN9SelavyApp3runEiPPc+0x5d8)[0x556d1c7f29a0] [emu-01:274595] [ 9] /usr/local/lib/libaskap_askap.so(_ZN5askap11Application4mainEiPPc+0xfb)[0x7f205db2ff59] [emu-01:274595] [10] selavy(main+0x55)[0x556d1c7ef936] [emu-01:274595] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f205c2dbb97] [emu-01:274595] [12] selavy(_start+0x2a)[0x556d1c7ee30a] [emu-01:274595] End of error message