Closed rainwoodman closed 6 years ago
Yes, that's what I find, too. But do recall that I did establish that the code gives consistent answers on cori and edison, which is what I set out to investigate.
On Tue, Mar 29, 2016 at 12:59 PM, Yu Feng notifications@github.com wrote:
The main error message is
--15909:0: aspacem Valgrind: FATAL: VG_N_SEGMENTS is too low. --15909:0: aspacem Increase it and rebuild. Exiting now.
The full log is here:
valgrind ../src/fiberassign params_fiberassign.txt ==15909== Memcheck, a memory error detector ==15909== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==15909== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==15909== Command: ../src/fiberassign params_fiberassign.txt ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x433873: printFile(char const*) (misc.cpp:999) ==15909== by 0x40380C: main (fiberassign.cpp:32) ==15909== Targfile mtl-lite.fits SStarsfile stdstars-lite.fits SkyFfile sky-lite.fits Secretfile truth-lite.fits surveyFile default_survey_list.txt tileFile 0.3.1/data/footprint/desi-tiles.par fibFile 0.3.1/data/focalplane/fiberpos.txt outDir /home/yfeng1/source/fiberassign/test/output/
PrintAscii true PrintFits false diagnose true
kind QSOLy-a QSOTracer LRG ELG FakeQSO FakeLRG SS SF type QSO QSO LRG ELG QSO LRG SS SF prio 3400 3400 3200 3000 3400 3200 0 0 priopost 3500 0 3200 0 0 0 0 0 goal 5 5 2 1 5 2 5 5 goalpost 5 1 2 1 1 1 5 5 lastpass 0 0 0 1 0 0 1 1 SS 0 0 0 0 0 0 1 0 SF 0 0 0 0 0 0 0 1 pass_intervals 0 50 100 150 200
Randomize false Pacman false Npass 5 MaxSS 10 MaxSF 40 PlateRadius 1.65 InterPlate 0 Analysis 0 InfDens false
TotalArea 15789.0 invFibArea 700 moduloGal 1 moduloFiber 1
Collision false Exact true AvCollide 3.2 Collide 1.98 NoCollide 7.0 PatrolRad 5.8 NeighborRad 14.05
PlotObsTime false PlotHistLya false PlotDistLya false PlotFreeFibHist false PlotFreeFibTime false PlotSeenDens false PrintGalObs false
MinDec -10. MaxDec 10. MinRa 0. MaxRa 10.
Verif false
read target, SS, SF files at 0.141 s
reading MTL file mtl-lite.fits HDU #2 Binary Table: Keeping 593965 targets within ra/dec ranges reading MTL file stdstars-lite.fits HDU #2 Binary Table: NUMOBS_MORE not found ... setting to 0 PRIORITY not found ... setting to 0 GRAYLAYER not found ... setting to 0 Keeping 27715 targets within ra/dec ranges reading MTL file sky-lite.fits HDU #2 Binary Table: NUMOBS_MORE not found ... setting to 0 PRIORITY not found ... setting to 0 GRAYLAYER not found ... setting to 0 Keeping 277634 targets within ra/dec ranges
... took : 14.9 s
Target size 593965 Standard Star size 621680 Sky Fiber size 899314 ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x403B89: main (fiberassign.cpp:50) ==15909== ==15909== Use of uninitialised value of size 8 ==15909== at 0x403BA7: main (fiberassign.cpp:51) ==15909== ==15909== Use of uninitialised value of size 8 ==15909== at 0x403BC6: main (fiberassign.cpp:51) ==15909== ==15909== Use of uninitialised value of size 8 ==15909== at 0x403BF5: main (fiberassign.cpp:52) ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x403C3E: main (fiberassign.cpp:55) ==15909== ==15909== Use of uninitialised value of size 8 ==15909== at 0x403C58: main (fiberassign.cpp:56) ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x5AF20CB: vfprintf (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF8D28: printf (in /usr/lib64/libc-2.22.so) ==15909== by 0x403C6D: main (fiberassign.cpp:56) ==15909== ==15909== Use of uninitialised value of size 8 ==15909== at 0x5AEE0CB: _itoa_word (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF2610: vfprintf (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF8D28: printf (in /usr/lib64/libc-2.22.so) ==15909== by 0x403C6D: main (fiberassign.cpp:56) ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x5AEE0D5: _itoa_word (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF2610: vfprintf (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF8D28: printf (in /usr/lib64/libc-2.22.so) ==15909== by 0x403C6D: main (fiberassign.cpp:56) ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x5AF268E: vfprintf (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF8D28: printf (in /usr/lib64/libc-2.22.so) ==15909== by 0x403C6D: main (fiberassign.cpp:56) ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x5AF21A1: vfprintf (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF8D28: printf (in /usr/lib64/libc-2.22.so) ==15909== by 0x403C6D: main (fiberassign.cpp:56) ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x5AF2741: vfprintf (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF8D28: printf (in /usr/lib64/libc-2.22.so) ==15909== by 0x403C6D: main (fiberassign.cpp:56) ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x5AF21F3: vfprintf (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF8D28: printf (in /usr/lib64/libc-2.22.so) ==15909== by 0x403C6D: main (fiberassign.cpp:56) ==15909== ==15909== Conditional jump or move depends on uninitialised value(s) ==15909== at 0x5AF222A: vfprintf (in /usr/lib64/libc-2.22.so) ==15909== by 0x5AF8D28: printf (in /usr/lib64/libc-2.22.so) ==15909== by 0x403C6D: main (fiberassign.cpp:56) ==15909== class 0 number 474347 class 1 number 68576 class 2 number 51042
... took : 21.3 s
getting file list number of tiles 10666 ==15909== Warning: set address range perms: large range [0x395db040, 0x747e6840) (undefined) size of P 10666 --15909:0: aspacem Valgrind: FATAL: VG_N_SEGMENTS is too low. --15909:0: aspacem Increase it and rebuild. Exiting now.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/desihub/fiberassign/issues/35
Yes. I think this (valgrind crashing) is a separate issue from #31 .
It may take a huge rewrite to get through valgrind cleanly, and the effort is likely not worth it: after all, most (if not all) memory access is protected by std:vector and looked pretty safe.
But I do think we shall leave a record about this incompatibility with valgrind on the bug tracker.
Is the script in test/test_fiberassign.py still the main "functional test"? Or is there some better test case to use?
My recollection is that a long time ago, though fibereassign ran fine, there were problems found with valgrind, of the sort "missing constructor." I thought this had been fixed, but what is needed is to run valgrind on in again.
On Fri, May 25, 2018 at 6:03 PM, Theodore Kisner notifications@github.com wrote:
Is the script in test/test_fiberassign.py still the main "functional test"? Or is there some better test case to use?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/desihub/fiberassign/issues/35#issuecomment-392136335, or mute the thread https://github.com/notifications/unsubscribe-auth/AIeo3cGTeD3CFpy22toQtrdyoDL2WYVtks5t2Ed9gaJpZM4H7FCC .
test/test_fiberassign.py
is probably broken after the interface change to use command line arguments instead of a config file. Example command from the minitest notebook that can be used for testing with valgrind:
basedir=/project/projectdirs/desi/datachallenge/reference_runs/18.3
fiberassign \
--mtl $basedir/targets/mtl.fits \
--stdstar $basedir/targets/standards-dark.fits \
--sky $basedir/targets/sky.fits \
--surveytiles $basedir/fiberassign/dark-tiles.txt \
--footprint $basedir/targets/test-tiles.fits \
--positioners $DESIMODEL/data/focalplane/fiberpos.txt \
--fibstatusfile $basedir/fiberassign/fiberstatus.ecsv \
--outdir $SCRATCH/temp
I cannot reproduce this on edison. Steps to verify:
Load your favorite desiconda environment
Go into your fiberassign checkout, master branch, and install to (for example) someplace in scratch:
$> PLATFORM=harpconfig INSTALL_DIR=$SCRATCH/software/fiberassign make clean
$> PLATFORM=harpconfig INSTALL_DIR=$SCRATCH/software/fiberassign make install
Note that I always build using the harpconfig platform file, which allows for using the same compile options as HARP (installed in desiconda) and SPECEX (which also uses harpconfig). This builds with the Intel compilers at NERSC- the same ones used to build the compiled packages in desiconda.
Make sure that this fiberassign is first in your path:
export PATH=$SCRATCH/software/fiberassign/bin:$PATH
Load the Intel-compatible version of valgrind, and run it.
$> module load valgrind
$> basedir=/project/projectdirs/desi/datachallenge/reference_runs/18.3 \
valgrind --leak-check=full --track-origins=yes fiberassign \
--mtl $basedir/targets/mtl.fits \
--stdstar $basedir/targets/standards-dark.fits \
--sky $basedir/targets/sky.fits \
--surveytiles $basedir/fiberassign/dark-tiles.txt \
--footprint $basedir/targets/test-tiles.fits \
--positioners $DESIMODEL/data/focalplane/fiberpos.txt \
--fibstatusfile $basedir/fiberassign/fiberstatus.ecsv \
--outdir ./out
Output is
==6505== Memcheck, a memory error detector
==6505== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==6505== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==6505== Command: /scratch2/scratchdirs/kisner/software/fiberassign/bin/fiberassign --mtl /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/mtl.fits --stdstar /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/standards-dark.fits --sky /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/sky.fits --surveytiles /project/projectdirs/desi/datachallenge/reference_runs/18.3/fiberassign/dark-tiles.txt --footprint /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/test-tiles.fits --positioners /global/common/software/desi/users/kisner/edison/20180130-1.2.4-spec/desimodel/0.9.1/data/focalplane/fiberpos.txt --fibstatusfile /project/projectdirs/desi/datachallenge/reference_runs/18.3/fiberassign/fiberstatus.ecsv --outdir ./out
==6505==
fiberassign_exec --mtl /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/mtl.fits --sky /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/sky.fits --stdstar /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/standards-dark.fits --fibstatusfile /project/projectdirs/desi/datachallenge/reference_runs/18.3/fiberassign/fiberstatus.ecsv --outdir ./out --surveytiles /project/projectdirs/desi/datachallenge/reference_runs/18.3/fiberassign/dark-tiles.txt --footprint /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/test-tiles.fits --positioners /global/common/software/desi/users/kisner/edison/20180130-1.2.4-spec/desimodel/0.9.1/data/focalplane/fiberpos.txt --starmask 60129542144 --rundate 2018-06-12
# Read target, SS, SF files at 4.7e-05 s
star mask 60129542144
Finding file: /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/standards-dark.fits
Found MTL input file: /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/standards-dark.fits
Reading MTL input file /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/standards-dark.fits
NUMOBS_MORE not found ... setting to 0
PRIORITY not found ... setting to 0
Keeping 1217 targets within ra/dec ranges
star mask 0
Finding file: /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/sky.fits
Found MTL input file: /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/sky.fits
Reading MTL input file /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/sky.fits
NUMOBS_MORE not found ... setting to 0
PRIORITY not found ... setting to 0
Keeping 48128 targets within ra/dec ranges
star mask 0
Finding file: /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/mtl.fits
Found MTL input file: /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/mtl.fits
Reading MTL input file /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/mtl.fits
Keeping 240902 targets within ra/dec ranges
# ...read targets took : 0.473 s
Target size 240902
Standard Star size 242119
Sky Fiber size 290247
# map position in target list to immutable targetid at 0.633 s
# assign priority classes at 0.633 s
class 0 number 1221
class 1 number 10452
class 2 number 43629
class 3 number 625
class 4 number 42659
class 5 number 37339
class 6 number 83651
class 7 number 15443
class 8 number 7100
# ...priority list took : 0.0108 s
# Start positioners at 0.644 s
before reading positioners
read the positioner file
sorted by fiber number
i 0 FibPos[i].fib_num 0
i 1 FibPos[i].fib_num 1
i 2 FibPos[i].fib_num 2
i 3 FibPos[i].fib_num 3
i 4 FibPos[i].fib_num 4
i 5 FibPos[i].fib_num 5
i 6 FibPos[i].fib_num 6
i 7 FibPos[i].fib_num 7
i 8 FibPos[i].fib_num 8
i 9 FibPos[i].fib_num 9
made neighbors
Input TimeSun Jun 12 00:00:00 2018
Current TimeSun Jun 12 00:00:00 2018
before reading status
Read from fiber status: Fiber_pos 0 Location 95 Broken 1 Stuck 0 dates 2018-02-21T09:23:51 2100-02-21T09:24:24
Init Time for FiberSun Feb 21 09:23:51 2018
End Time for FiberSun Feb 21 09:24:24 2100
Changing fiberastatus entry: Fiber 0 Location 95
BROKEN
Read from fiber status: Fiber_pos 1 Location 62 Broken 1 Stuck 0 dates 2018-02-21T09:23:51 2100-02-21T09:24:24
Init Time for FiberSun Feb 21 09:23:51 2018
End Time for FiberSun Feb 21 09:24:24 2100
Changing fiberastatus entry: Fiber 1 Location 62
BROKEN
Read from fiber status: Fiber_pos 2 Location 102 Broken 0 Stuck 1 dates 2018-02-21T09:23:51 2100-02-21T09:24:24
Init Time for FiberSun Feb 21 09:23:51 2018
End Time for FiberSun Feb 21 09:24:24 2100
Changing fiberastatus entry: Fiber 2 Location 102
STUCK
Read from fiber status: Fiber_pos 3 Location 82 Broken 0 Stuck 1 dates 2018-02-21T09:23:51 2100-02-21T09:24:24
Init Time for FiberSun Feb 21 09:23:51 2018
End Time for FiberSun Feb 21 09:24:24 2100
Changing fiberastatus entry: Fiber 3 Location 82
STUCK
Read from fiber status: Fiber_pos 4 Location 131 Broken 0 Stuck 1 dates 2018-02-21T09:23:51 2100-02-21T09:24:24
Init Time for FiberSun Feb 21 09:23:51 2018
End Time for FiberSun Feb 21 09:24:24 2100
Changing fiberastatus entry: Fiber 4 Location 131
STUCK
read status file
# ..posiioners took : 0.208 s
# Start plates at 0.853 s
number of tiles 7
Finding file: /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/test-tiles.fits
Found input tile centers file: /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/test-tiles.fits
Reading input tile centers file /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/test-tiles.fits
size of P 10
# Start invert tiles at 0.0147 s
# ..inversion took : 2.31e-05 s
# do inversion of used plates at 0.0147 s
# .. sued plates inversion took : 0.00207 s
# Read 7 plate centers from /project/projectdirs/desi/datachallenge/reference_runs/18.3/targets/test-tiles.fits and 5000 fibers from /global/common/software/desi/users/kisner/edison/20180130-1.2.4-spec/desimodel/0.9.1/data/focalplane/fiberpos.txt
# ..plates took : 0.0172 s
# Start building HTM tree at 0.87 s
# Doing kd-tree... took : 0.0747 s
# collect galaxies at at 0.945 s
# Begin collecting available galaxies
# ... took : 0.195 s
# ... took : 0.196 s
# collect available tile-fibers at at 1.14 s
# Begin computing available tilefibers
# ... took : 0.0629 s
galaxies outside footprint 20299
Nplate 7 Ngal 290247 Nfiber 5000
# Start assignment at : 1.22 s
# Begin simple assignment :
# ... took : 0.45 s
countme 35000
Plates actually used 7
start redistribute
# Begin redistribute TF :
46 redistributions of tile-fibers
# ... took : 0.00648 s
# Begin improve :
improvements 48
# ... took : 0.00188 s
start redistribute
# Begin redistribute TF :
4 redistributions of tile-fibers
# ... took : 0.00375 s
# assign SS and SF at 1.68 s
# count SS and SF at 1.82 s
Totals SS 0 SF 2800 class 0 0 class 1 0 class 2 12 class 3 0 class 4 586 class 5 0 class 6 17806 class 7 7935 class 8 5826
# print fits files at 1.82 s
# Finished !... in 1.91 s
Looking at the platform files that are labeled "nersc_*", they seem to be using GNU compilers, but linking to cfitsio from desiconda (built with Intel). In principle those should be ABI compatible, but... Probably safer to use the harpconfig platform and get the same compilers and options used for compiled code everywhere else in the desi stack.
Ah, interesting- it looks like the final executable is now "fiberassign_exec" rather than "fiberassign". My tests above were with an older version of the executable. Ignore my previous results.
Ok, some more information. Using intel-compiled fiberassign with the intel-compiled libcfitsio from desiconda causes valgrind to die with an unhandled instruction error. This is due to sse4 instructions in the Intel math library which are linked in with "-lm" when building cfitsio. I built my own cfitsio (and valgrind) on edison with gcc-7.1, and then built fiberassign with the same gcc and ran it. This produces fairly clean output: there is one place to dig deeper to double check that memory is initialized and then there are several places we need to check to ensure memory is being freed.
The conclusion here is: don't use valgrind with Intel-compiled code. Fortunately we can test this with valgrind using gcc, and could also run in vtune if we needed to check the Intel built version.
Here is the valgrind output: fiberassign_valgrind.log
I'll leave this ticket open until I investigate those areas flagged in the output.
The main error message is
The full log is here: