Closed wilsonmr closed 6 years ago
Will do
Yeah ok, 8 out of 10 replicas successfully ran and produced all files so it seems fast evolution is the origin of the problem
Great, thanks. I will have a look.
Could you please revert nnpdfcpp to master, then modify, recompile APFEL (fixstringlogics branch) with the diff below and then rerun again nnfit?
diff --git a/src/Evolution/odeintnsQCD.f b/src/Evolution/odeintnsQCD.f
index c75df28..7a74651 100644
--- a/src/Evolution/odeintnsQCD.f
+++ b/src/Evolution/odeintnsQCD.f
@@ -73,6 +73,7 @@
*
write(6,*) "In odeintns.f:"
write(6,*) "too many steps!"
+ write(6,*) i,mu21,mu22,ystart,y
call exit(-10)
*
return
ok so some replicas got the old error
In odeintns.f:
stepsize underflow in rkqsns
however some clearly outputted that extra stuff, it's a bit long to post here but there are some NaNs, is that what you were saying could be an issue? I can send you the full output via email if you want?
Yes, please send me by mail, I guess we are close to some variable not initialized properly or numerical rounding due to your cluster setup.
Ok, could you please reset APFEL (from the PR) and apply this patch then reset nnfit, recompile everything and rerun? (this will check if the problem is cluster memory or not). Could you please tell me the gfortran version you are using (gfortran --version
)?
diff --git a/src/commons/grid.h b/src/commons/grid.h
index 1dff3cb..32c1f77 100644
--- a/src/commons/grid.h
+++ b/src/commons/grid.h
@@ -1,7 +1,7 @@
* -*-fortran-*-
integer nint_max
- parameter(nint_max=200)
+ parameter(nint_max=50)
integer nint_max_DIS
parameter(nint_max_DIS=120)
integer ngrid_max
Just to confirm, which branch of apfel should I recompile.. fixstringleak
, fixstringlogics
or master
.
Using conda gfortran but it relies on the flag $GFORTRAN
for example:
(apfeltest) [s1758208@login04(eddie) s1758208]$ gfortran --version
GNU Fortran (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
Copyright (C) 2015 Free Software Foundation, Inc.
GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING
(apfeltest) [s1758208@login04(eddie) s1758208]$ $GFORTRAN --version
GNU Fortran (crosstool-NG fa8859cb) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
but the compilation, I'm pretty sure, is using the correct compilers
checking whether we are using the GNU C++ compiler... yes
checking whether /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/x86_64-conda_cos6-linux-gnu-c++ accepts -g... yes
checking for style of include used by make... GNU
checking dependency style of /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/x86_64-conda_cos6-linux-gnu-c++... gcc3
checking whether we are using the GNU Fortran compiler... yes
checking whether /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/x86_64-conda_cos6-linux-gnu-gfortran accepts -g... yes
Always fixstringleak, as PR #9. So are you sure your compiler is > 4.x.x?
That seems to be using gfortran 7.2
I'm fairly certain yes, like Tommaso mentioned in the code call we are using the method described here http://pcteserver.mi.infn.it/~nnpdf/validphys-docs/guide.html#development-installs however for compiling apfel we take the additional steps of installing the conda gfortran compilers and remove/compile apfel from source
I guess it relies on conda properly setting environment variables when I enter environment which I'm confident it does and apfel compilation using them, which again I think it does
I should say without debug flags the only warning I get when compiling apfel is
libtool: warning: library '/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.2.0/../../../../x86_64-conda_cos6-linux-gnu/lib/../lib/libstdc++.la' was moved.
and
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
Those warnings are fine.
@scarrazza do I need to edit something else?
In initGrid.f:
Number of grid points too large:
found = 83
Maximum value allowed = 50
You should reduce it.
Ok, please set nint_max=100
instead of 50.
Concerning gfortran, could you please post here the output of ldd <anaconda env>/libAPFEL.so
? Thanks.
It says it's not a dynamic excecutable
does this help? I guess not.
readelf -d /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libAPFEL.so
Dynamic section at offset 0x17cd58 contains 28 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libgfortran.so.4]
0x0000000000000001 (NEEDED) Shared library: [libquadmath.so.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x000000000000000e (SONAME) Library soname: [libAPFEL.so.0]
0x000000000000000f (RPATH) Library rpath: [/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib:/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.2.0/../../../../x86_64-conda_cos6-linux-gnu/lib/../lib]
0x000000000000000c (INIT) 0x17750
0x000000000000000d (FINI) 0xf69e0
0x0000000000000019 (INIT_ARRAY) 0x37ccc8
0x000000000000001b (INIT_ARRAYSZ) 48 (bytes)
0x0000000000000004 (HASH) 0x190
0x0000000000000005 (STRTAB) 0xa1a0
0x0000000000000006 (SYMTAB) 0x25f8
0x000000000000000a (STRSZ) 26368 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x37cf58
0x0000000000000007 (RELA) 0x11420
0x0000000000000008 (RELASZ) 25392 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x0000000000000018 (BIND_NOW)
0x000000006ffffffb (FLAGS_1) Flags: NOW
0x000000006ffffffe (VERNEED) 0x112f0
0x000000006fffffff (VERNEEDNUM) 5
0x000000006ffffff0 (VERSYM) 0x108a0
0x000000006ffffff9 (RELACOUNT) 9
0x0000000000000000 (NULL) 0x0
Can you do strace ldd libAPFEL.so
and paste the whole oputut? I guess it is getting out of RAM somewhere.
yes probably, the master nodes have become really restrictive recently
ok I submitted a job that ran ldd libAPFEL.so
linux-vdso.so.1 => (0x00007ffc9cdbf000)
libgfortran.so.4 => /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libgfortran.so.4 (0x00002af779f69000)
libquadmath.so.0 => /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libquadmath.so.0 (0x00002af77a293000)
libstdc++.so.6 => /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libstdc++.so.6 (0x00002af77a4c4000)
libm.so.6 => /lib64/libm.so.6 (0x00002af77a7fe000)
libc.so.6 => /lib64/libc.so.6 (0x00002af77ab00000)
libgcc_s.so.1 => /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libgcc_s.so.1 (0x00002af77aec1000)
/lib64/ld-linux-x86-64.so.2 (0x00002af6c072d000)
sorry I'm trying to run that thing you sent but I'm just getting seg faults. Maybe I did something wrong
Good, are you getting the segfault before or after the point where the too many steps
stop appears?
before (in fact before anything is being outputted):
PDFBasis:: initialised basis: NN31ICFitBasis
Selecting FitBasis: NN31IC
** New Log File Generated. Log 'GAMinimizer' at nmcapfel/nnfit/replica_1/GAMin.log
Minimiser: Genetic Algorithm w/ nodal mutations
PDF: NNPDF_Fit ErrorType: No Errors booked
/var/spool/gridscheduler/execd/node2b01/job_scripts/2529128: line 31: 23230 Segmentation fault (core dumped) nnfit $SGE_TASK_ID nmcapfel
My guess is that the cluster node cannot allocate the memory required by APFEL. Thanks for the information, I think I'm getting close to your installation setup (libgfortran.so.4, gcc/gfortran 7.2.0 from conda, etc.) I will let you know if I manage to reproduce your crash.
Meanwhile could you please revert to the original APFEL (fixstringleak) and nnpdf (master), and then compile APFEL with sanitizer enabled (apply the diff below and then run autoreconf -i
):
diff --git a/configure.ac b/configure.ac
index 2509e11..488a799 100644
--- a/configure.ac
+++ b/configure.ac
@@ -13,7 +13,7 @@ AC_CONFIG_HEADERS([config/config.h include/APFEL/FortranWrappers.h])
## Set Fortran compiler behaviour
if test "x$FCFLAGS" == "x"; then
- FCFLAGS="-O3 -Wunused"
+ FCFLAGS="-g -O3 -Wunused -fstack-protector-strong -fsanitize=address"
fi
# Try to respect users' Fortran compiler variables
if test "x$FC" == "x"; then
@@ -129,7 +129,7 @@ fi
## Set final FCFLAGS, CXXFLAGS and CPPFLAGS
-AM_CPPFLAGS="$AM_CPPFLAGS -I\$(top_srcdir)/include -I\$(top_builddir)/include"
+AM_CPPFLAGS="$AM_CPPFLAGS -fstack-protector-strong -I\$(top_srcdir)/include -I\$(top_builddir)/include"
AM_CPPFLAGS=["$AM_CPPFLAGS -DDATA_PATH="$datadir" -DAPFEL_VERSION="$PACKAGE_VERSION" "]
AC_SUBST(AM_CPPFLAGS)
And then configure nnpdf in Debug mode (you can easily change that with ccmake), compile and rerun nnfit?
I did that and I can't seem to run filter
or nnfit
getting
==53668==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
==53668==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
From my side I have setup exactly the same environment you have on my laptop and server, both of them work well (no leaks, faults).
I think the asan error you quote is another evidence of low memory (less than 3.7gb).
I would like to check your cluster submission instructions but I do not have permission to view the wiki page you have linked. Could please send me as PDF?
I'd like to try ASAN on a system where the sysadmins enforce a hard limit. Is this possible?
No. asan requires 20Tb (+ a bit) of virtual memory to properly function. talk to your sysadmins to relax their limits
I guess this isn't so relevant but just in case you want to play with ulimit, I am in an interactive session on one of the job nodes and this is the output of ulimit:
[s1758208@node1h19 ~]$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514047
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) 4194304
file locks (-x) unlimited
I am currently running the same fit on my desktop. Using an almost identical setup to the cluster in terms of conda environment. If it succeeds I will put some more effort into recreating the exact install procedure I used on the cluster and see if I can recreate the error
ok I got access to a node and ran nnfit in debug mode, do I need to run something specific to get asan output?
**** Producing T0 Predictions with Set NNPDF31_nnlo_as_0118
- Final Positivity Test
- Positivity Vetoed
- Writing fitinfo file...
- Computing arclengths...
- Writing sumrules file...
- Writing preproc file...
- Writing params file...
- Printing grid to file: nmcapfel/nnfit/replica_1/nmcapfel.gridvalues
- Writing out LHAPDF grid: nmcapfel
- Solving DGLAP for LHAPDF grid...
In odeintns.f:
too many steps!
Thanks for using LHAPDF 6.2.1. Please make sure to cite the paper:
Eur.Phys.J. C75 (2015) 3, 132 (http://arxiv.org/abs/1412.7420)
=================================================================
==196341==ERROR: LeakSanitizer: detected memory leaks
Indirect leak of 330288 byte(s) in 768 object(s) allocated from:
#0 0x7f3ceca63afc in __interceptor_malloc /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libsanitizer/asan/asan_malloc_linux.cc:62
#1 0x7f3c30e51754 in sqlite3MemMalloc (/exports/csce/eddie/ph/groups/rbm_ml/michael/miniconda/envs/nnpdf-dev/lib/libsqlite3.so.0+0x40754)
Indirect leak of 4968 byte(s) in 15 object(s) allocated from:
#0 0x7f3ceca63df8 in __interceptor_realloc /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libsanitizer/asan/asan_malloc_linux.cc:75
#1 0x7f3c30e516f9 in sqlite3MemRealloc (/exports/csce/eddie/ph/groups/rbm_ml/michael/miniconda/envs/nnpdf-dev/lib/libsqlite3.so.0+0x406f9)
SUMMARY: AddressSanitizer: 335256 byte(s) leaked in 783 allocation(s).
it seems as if the address sanitiser hasn't told us anything though? sqlite leaks was already known and it would have exited earlier if there was a leak with apfel
Is this error related in any way to the compilation options? As said earlier, my bet is that this error is simply saying that the replicas are not too smooth, due to the small number of iterations. Do you get anything different if you compile apfel without the stack protector?
On Tue, May 29, 2018 at 3:15 PM, wilsonmr notifications@github.com wrote:
it seems as if the address sanitiser hasn't told us anything though? sqlite leaks was already known and it would have exited earlier if there was a leak with apfel
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NNPDF/nnpdf/issues/151#issuecomment-392792453, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUsFgnCyNX3DDDPpy6irgu7MqG0a3ks5t3Vf_gaJpZM4S7Ce8 .
Trying that now, I chnged the flags and removed -fstack-protector-strong
and replaced with -fno-stack-protector
is that correct?
Yes, I'd be curious if that makes a difference.
As far as I can tell, provided that I did it correctly I don't think it made any difference
I ran using the latest conda package, no compilation, I get the
In odeintns.f:
stepsize underflow in rkqsns
error. I guess I should try the tests @scarrazza was mentioning in the phone conference
Can you post the runcard you are using? I may try this as well...
On Thu, May 31, 2018 at 3:54 PM, wilsonmr notifications@github.com wrote:
I ran using the latest conda package, no compilation, I get the
In odeintns.f: stepsize underflow in rkqsns
error. I guess I should try the tests @scarrazza https://github.com/scarrazza was mentioning in the phone conference
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NNPDF/nnpdf/issues/151#issuecomment-393557577, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUnl5MJ9rX0FZHXWzTT_OAkKn1v7Kks5t4AQrgaJpZM4S7Ce8 .
Try this, the description is nonsense. I will also try using the new package on my work desktop
Does anyone know why APFEL prints Intrinsic charm disabled
on theory 53?
The first splash is called by CheckAPFEL
which uses a default setup. The second splash screen should quote the variables correctly.
I have run the runcard above once with the conda packages and got a segfault after initializing apfel. I have run it a second and third time and it is still running (may not lt it finish since I am not in the mood of googling how to use screen).
I then compiled both nnpdf and apfel (master versions of both) with all the debug flags (except -implicit-none, which causes apfel to not compile) and got this from ASAN:
Initialization of the DIS module completed in 41.170 s
Check ... succeded
ASAN:DEADLYSIGNAL
=================================================================
==7772==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffce2068ba8 (pc 0x7f86015639f9 bp 0x7ffc710340c0 sp 0x7ffc710338c8 T0)
==7772==The signal is caused by a READ memory access.
#0 0x7f86015639f8 in _gfortran_string_len_trim /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libgfortran/intrinsics/string_intrinsics_inc.c:218
#1 0x7f860386e0a9 in setpdfset_ (/home/zaharik/miniconda3/lib/libAPFEL.so.0+0x280a9)
#2 0x7f8603939ad1 in APFEL::SetPDFSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (/home/zaharik/miniconda3/lib/libAPFEL.so.0+0xf3ad1)
#3 0x7f86be54019c in APFELSingleton::Initialize(NNPDFSettings const&, NNPDF::PDFSet* const&) /home/zaharik/nnpdf/nnpdfcpp/src/nnfit/src/apfelevol.cc:278
#4 0x7f86be582a10 in FitPDFSet::FitPDFSet(NNPDFSettings const&, FitBasis*) /home/zaharik/nnpdf/nnpdfcpp/src/nnfit/src/fitpdfset.cc:52
#5 0x7f86be535498 in FitPDFSet* FitPDFSet::Generate<NNPDF::MultiLayerPerceptron, GAMinimizer>(NNPDFSettings const&, FitBasis*) /home/zaharik/nnpdf/nnpdfcpp/src/nnfit/inc/fitpdfset.h:39
#6 0x7f86be50a614 in main /home/zaharik/nnpdf/nnpdfcpp/src/nnfit/src/nnfit.cc:165
#7 0x7f860287dd1c in __libc_start_main (/lib64/libc.so.6+0x3b7de1ed1c)
#8 0x7f86be522d9b (/home/zaharik/miniconda3/envs/apfel-dbg/bin/nnfit+0x4cd9b)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libgfortran/intrinsics/string_intrinsics_inc.c:218 in _gfortran_string_len_trim
==7772==ABORTING
@scarrazza can you see what is the problem?
Thanks, let me try to reproduce that.
Annoyingly enough, it doesn't happen always to me, even when I rerun the same thing.
I also got:
Checking APFEL v3.0.2 ...
At line 8 of file DIS/SetProjectileDIS.f
Fortran runtime error: Actual string length is shorter than the declared one for dummy argument 'lept' (8/12)
Could you please post here all the gfortran flags you are using?
I have
$ echo $FFLAGS
-fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fcheck=all -fbacktrace -fvar-tracking-assignments -pipe
this is like DEBUG_FFLAGS
in a conda environment but removing -fimplicit-none
.
Great, thanks. I can reproduce your error messages and looks like my PR is just 1% of the fix, so I have to extend the fix to all places where the dummy string size is set to a custom number. Moreover the compilation warnings look very bad.
We managed to isolate the issue and confirm that there is a memory issue, see https://github.com/scarrazza/apfel/pull/11. However the fix, if any, is not trivial.
See discussion in #173 .
Hello,
Having an issue running fits on the cluster at Edinburgh, the fits appear to be finishing however they are not outputting the LHAPDF grids, so we don't get the results.
@nhartland suggests it is a problem concerning APFEL.
I have pasted below the final bit of output from the fit, I have the full outputs available for a few different configurations of fits: