andreufont commented 4 years ago

The new folders with examples do not contain Pk_CAMB_test.dat or similar, and the code crashes soon after starting.

Also, after I copied the CAMB file from a previous run, and tried to run simple/param.cfg, I get the following error at the end of the program:

_ Error in `./CoLoRe': corrupted size vs. prev_size: 0x000001008c8cb1d0

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 7948 on node nid00213 exited on signal 6 (Aborted). --------------------------------------------------------------------------_

On the other hand, cl_test runs fine.

damonge commented 4 years ago

OK, thanks @andreufont . Check out branch #50 I've added the missing files, but I can't reproduce the error above. What command did you use to run CoLoRe?

andreufont commented 4 years ago

I used the suggested function from INSTALL_Cori.md:

mpirun -n 1 ./CoLoRe examples/simple/param.cfg

@fjaviersanchez - can you reproduce the error? HEre is my "git diff Makefile":

########## User-definable stuff ########## #

Compiler and compilation options

-COMP_SER = gcc -COMP_MPI = mpicc +COMP_SER = cc +COMP_MPI = cc OPTIONS = -Wall -O3 -std=c99 #

Behavioural flags

@@ -19,7 +19,7 @@ USE_SINGLE_PRECISION = yes

Add random perturbations to kappa from redshifts outside the box

ADD_EXTRA_KAPPA = yes

Compile with HDF5 capability? Set to "yes" or "no"

-USE_HDF5 = yes +USE_HDF5 = no

Use OMP parallelization? Set to "yes" or "no"

USE_OMP = yes

Use MPI parallelization? Set to "yes" or "no"

@@ -29,29 +29,27 @@ USE_MPI = yes

If two or more of the dependencies reside in the same paths, only

one instance is necessary.

GSL

-#GSL_INC = -I/add/path -#GSL_LIB = -L/add/path -GSL_INC = -I/home/alonso/include -GSL_LIB = -L/home/alonso/lib +GSL_INC = -I${GSL_DIR}/include +GSL_LIB = -L${GSL_DIR}/lib -lgsl -lgslcblas

FFTW

-FFTW_INC = -FFTW_LIB = -#cfitsio -FITS_INC = -FITS_LIB = +FFTW_INC = -I${FFTW_DIR}/include +FFTW_LIB = -L${FFTW_DIR}/lib

cfitsio

+FITS_INC = -I${CFITSIO_DIR}/include +FITS_LIB = -L${CFITSIO_DIR}/lib -lcfitsio +#HDF5 HDF5_INC = HDF5_LIB =

libconfig

-CONF_INC = -CONF_LIB = +CONF_INC = -I${COLORE_PATH}/Install.cori/include +CONF_LIB = -L${COLORE_PATH}/Install.cori/lib

healpix

-HPIX_INC = -HPIX_LIB = +HPIX_INC = -I${HEALPIX_PATH}/include +HPIX_LIB = -L${HEALPIX_PATH}/lib

libsharp

-SHT_INC = -SHT_LIB = -# +SHT_INC = -I${COLORE_PATH}/libsharp/auto/include +SHT_LIB = -L${COLORE_PATH}/libsharp/auto/lib +

andreufont commented 4 years ago

Note: I did the above from an interactive run. Not sure if I should have sourced any file before running... I'll try again tomorrow.

damonge commented 4 years ago

OK, thanks. I can't reproduce this on my laptop, so it may be a NERSC thing. I'll try to reproduce it this week. Just to check: did you try again with the latest modifications in the cleanup branch?

andreufont commented 4 years ago

Yes, same error in both branches. I'm recompiling from scratch now after chatting with Javi on the Slack channel.

andreufont commented 4 years ago

I got a fresh new version of the repo, and recompile the new branch using the instructions. I get the same error, but the code finishes and I can make all the pretty plots:

salloc -N 1 -C haswell -q interactive -t 01:00:00 source ~/setup_CoLoRe.sh # load relevant modules mpirun -n 1 ./CoLoRe examples/simple/param.cfg > examples/simple/info &

Error in `./CoLoRe': corrupted size vs. prev_size: 0x0000010080411670

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 6674 on node nid00042 exited on signal 6 (Aborted).

fjaviersanchez commented 4 years ago

I get the same error as @andreufont. The code works until the end but it dies. I think there's a reused memory allocation somewhere that is messing things up. I'll run a backtrace.

fjaviersanchez commented 4 years ago

This is the backtrace:

*** Error in `/global/u2/j/jsanch87/CoLoRe_edison/CoLoRe': free(): invalid next size (normal): 0x000001008c8b88d0 ***

Thread 1 "CoLoRe" received signal SIGABRT, Aborted.
raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000000020987501 in abort () at abort.c:79
#2  0x00000000209a8617 in __libc_message (action=action@entry=(do_abort | do_backtrace), fmt=fmt@entry=0x20b9f160 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:181
#3  0x00000000209aea43 in malloc_printerr (action=<optimized out>, str=0x20b9f498 "free(): invalid next size (normal)", ptr=<optimized out>, ar_ptr=<optimized out>)
    at malloc.c:5428
#4  0x00000000209b02f1 in _int_free (av=0x40cd8e20 <main_arena>, p=<optimized out>, have_lock=<optimized out>) at malloc.c:4170
#5  0x000000002000358d in catalog_free ()
#6  0x00000000200171a8 in param_colore_free ()
#7  0x00000000200270c8 in main ()

fjaviersanchez commented 4 years ago

More clues, it only happens for n_grid>=128 and when the store_skewers parameter for srcs2 is true. It works fine if you don't store the skewers for srcs2.

damonge commented 4 years ago

ok, I'll try to run valgrind asap on this.

fjaviersanchez commented 4 years ago

I ran valgrind. I can send you the output if you want :) (It looks like the guilty part is line 531 at common.c

damonge commented 4 years ago

Ok, I can't really reproduce this on my laptop (and valgrind doesn't find anything either). I'm gonna try NERSC, but it'd be good if you can send me your valgrind output @fjaviersanchez

damonge commented 4 years ago

OK, no idea what's going on. I can reproduce the error at NERSC, but my valgrind doesn't really point to any memory leaks in the CoLoRe code (there's some stuff related to MPI, but that's it). Your log file would be useful @fjaviersanchez . In the meantime I'm trying some other stuff.

damonge commented 4 years ago

OK, this is where we stand:

I cannot reproduce this on my laptop, but I can reproduce it at NERSC.
I can't really tell where the problem is coming from. Valgrind seems to point towards one specific line of code, but if you move that line around the error goes away.
I've been having similar errors with other codes in NERSC that work absolutely fine everywhere else.
If I run module unload craype-hugepages2M, the problem goes away completely. This also solves the issues I've seen with all my other codes, so I suspect this is something to do with the new NERSC compilers not liking something about libsharp (which is the common denominator in all of them).

So, @andreufont , @fjaviersanchez if you could check that the problem goes away for you after running module unload craype-hugepages2M, please let me know and I'll just add this to the Cori instructions and close this issue.

fjaviersanchez commented 4 years ago

I confirm this fixes it. Also I saw this on another cray-based system's help page about the craype-hugepages2M module (from http://www.archer.ac.uk/about-archer/software/modcatalogue/craype-hugepages2M/):

NOTE !!!: If executing on node other than compute, unload this module. so I guess we have to update the instructions and say this (it probably works fine in the compute nodes without doing this but in the interactive nodes it crashes) so better safe than sorry.

Also some more info here: https://docs.nersc.gov/performance/variability/

damonge commented 4 years ago

OK, but this is what confuses me: we need to unload this in the compute node. Anyway, I'm trying to track this down, but I think updating the instructions for now is good enough. I did so in #50 .

Thanks!

damonge commented 4 years ago

So, after talking to the LSST NERSC wizards, it may be that we take a hit by turning off hugepages, but it's not clear. So this means that we may have to reopen this if we realize that CoLoRe is now much slower. For now I'll just merge #50 and solve this.

damonge / CoLoRe

Fix new example folders #49

_ Error in `./CoLoRe': corrupted size vs. prev_size: 0x000001008c8cb1d0

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Compiler and compilation options

Behavioural flags

Add random perturbations to kappa from redshifts outside the box

Compile with HDF5 capability? Set to "yes" or "no"

Use OMP parallelization? Set to "yes" or "no"

Use MPI parallelization? Set to "yes" or "no"

If two or more of the dependencies reside in the same paths, only

one instance is necessary.

GSL

FFTW

cfitsio

libconfig

healpix

libsharp

Error in `./CoLoRe': corrupted size vs. prev_size: 0x0000010080411670

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 6674 on node nid00042 exited on signal 6 (Aborted).