Closed andreufont closed 4 years ago
OK, thanks @andreufont . Check out branch #50 I've added the missing files, but I can't reproduce the error above. What command did you use to run CoLoRe?
I used the suggested function from INSTALL_Cori.md:
mpirun -n 1 ./CoLoRe examples/simple/param.cfg
@fjaviersanchez - can you reproduce the error? HEre is my "git diff Makefile":
########## User-definable stuff ########## #
-COMP_SER = gcc -COMP_MPI = mpicc +COMP_SER = cc +COMP_MPI = cc OPTIONS = -Wall -O3 -std=c99 #
@@ -19,7 +19,7 @@ USE_SINGLE_PRECISION = yes
ADD_EXTRA_KAPPA = yes
-USE_HDF5 = yes +USE_HDF5 = no
USE_OMP = yes
@@ -29,29 +29,27 @@ USE_MPI = yes
-#GSL_INC = -I/add/path -#GSL_LIB = -L/add/path -GSL_INC = -I/home/alonso/include -GSL_LIB = -L/home/alonso/lib +GSL_INC = -I${GSL_DIR}/include +GSL_LIB = -L${GSL_DIR}/lib -lgsl -lgslcblas
-FFTW_INC = -FFTW_LIB = -#cfitsio -FITS_INC = -FITS_LIB = +FFTW_INC = -I${FFTW_DIR}/include +FFTW_LIB = -L${FFTW_DIR}/lib
+FITS_INC = -I${CFITSIO_DIR}/include +FITS_LIB = -L${CFITSIO_DIR}/lib -lcfitsio +#HDF5 HDF5_INC = HDF5_LIB =
-CONF_INC = -CONF_LIB = +CONF_INC = -I${COLORE_PATH}/Install.cori/include +CONF_LIB = -L${COLORE_PATH}/Install.cori/lib
-HPIX_INC = -HPIX_LIB = +HPIX_INC = -I${HEALPIX_PATH}/include +HPIX_LIB = -L${HEALPIX_PATH}/lib
-SHT_INC = -SHT_LIB = -# +SHT_INC = -I${COLORE_PATH}/libsharp/auto/include +SHT_LIB = -L${COLORE_PATH}/libsharp/auto/lib +
Note: I did the above from an interactive run. Not sure if I should have sourced any file before running... I'll try again tomorrow.
OK, thanks. I can't reproduce this on my laptop, so it may be a NERSC thing. I'll try to reproduce it this week. Just to check: did you try again with the latest modifications in the cleanup
branch?
Yes, same error in both branches. I'm recompiling from scratch now after chatting with Javi on the Slack channel.
I got a fresh new version of the repo, and recompile the new branch using the instructions. I get the same error, but the code finishes and I can make all the pretty plots:
salloc -N 1 -C haswell -q interactive -t 01:00:00 source ~/setup_CoLoRe.sh # load relevant modules mpirun -n 1 ./CoLoRe examples/simple/param.cfg > examples/simple/info &
I get the same error as @andreufont. The code works until the end but it dies. I think there's a reused memory allocation somewhere that is messing things up. I'll run a backtrace.
This is the backtrace:
*** Error in `/global/u2/j/jsanch87/CoLoRe_edison/CoLoRe': free(): invalid next size (normal): 0x000001008c8b88d0 ***
Thread 1 "CoLoRe" received signal SIGABRT, Aborted.
raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x0000000020987501 in abort () at abort.c:79
#2 0x00000000209a8617 in __libc_message (action=action@entry=(do_abort | do_backtrace), fmt=fmt@entry=0x20b9f160 "*** Error in `%s': %s: 0x%s ***\n")
at ../sysdeps/posix/libc_fatal.c:181
#3 0x00000000209aea43 in malloc_printerr (action=<optimized out>, str=0x20b9f498 "free(): invalid next size (normal)", ptr=<optimized out>, ar_ptr=<optimized out>)
at malloc.c:5428
#4 0x00000000209b02f1 in _int_free (av=0x40cd8e20 <main_arena>, p=<optimized out>, have_lock=<optimized out>) at malloc.c:4170
#5 0x000000002000358d in catalog_free ()
#6 0x00000000200171a8 in param_colore_free ()
#7 0x00000000200270c8 in main ()
More clues, it only happens for n_grid>=128 and when the store_skewers
parameter for srcs2
is true
. It works fine if you don't store the skewers for srcs2
.
ok, I'll try to run valgrind asap on this.
I ran valgrind. I can send you the output if you want :) (It looks like the guilty part is line 531 at common.c
Ok, I can't really reproduce this on my laptop (and valgrind doesn't find anything either). I'm gonna try NERSC, but it'd be good if you can send me your valgrind output @fjaviersanchez
OK, no idea what's going on. I can reproduce the error at NERSC, but my valgrind doesn't really point to any memory leaks in the CoLoRe code (there's some stuff related to MPI, but that's it). Your log file would be useful @fjaviersanchez . In the meantime I'm trying some other stuff.
OK, this is where we stand:
module unload craype-hugepages2M
, the problem goes away completely. This also solves the issues I've seen with all my other codes, so I suspect this is something to do with the new NERSC compilers not liking something about libsharp (which is the common denominator in all of them).So, @andreufont , @fjaviersanchez if you could check that the problem goes away for you after running module unload craype-hugepages2M
, please let me know and I'll just add this to the Cori instructions and close this issue.
I confirm this fixes it. Also I saw this on another cray-based system's help page about the craype-hugepages2M
module (from http://www.archer.ac.uk/about-archer/software/modcatalogue/craype-hugepages2M/):
NOTE !!!: If executing on node other than compute, unload this module.
so I guess we have to update the instructions and say this (it probably works fine in the compute nodes without doing this but in the interactive nodes it crashes) so better safe than sorry.
Also some more info here: https://docs.nersc.gov/performance/variability/
OK, but this is what confuses me: we need to unload this in the compute node. Anyway, I'm trying to track this down, but I think updating the instructions for now is good enough. I did so in #50 .
Thanks!
So, after talking to the LSST NERSC wizards, it may be that we take a hit by turning off hugepages
, but it's not clear. So this means that we may have to reopen this if we realize that CoLoRe is now much slower. For now I'll just merge #50 and solve this.
The new folders with examples do not contain Pk_CAMB_test.dat or similar, and the code crashes soon after starting.
Also, after I copied the CAMB file from a previous run, and tried to run simple/param.cfg, I get the following error at the end of the program:
_ Error in `./CoLoRe': corrupted size vs. prev_size: 0x000001008c8cb1d0
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 7948 on node nid00213 exited on signal 6 (Aborted). --------------------------------------------------------------------------_
On the other hand, cl_test runs fine.