NNPDF / nnpdf

An open-source machine learning framework for global analyses of parton distributions.
https://docs.nnpdf.science/
GNU General Public License v3.0
28 stars 6 forks source link

APFEL error at the end of the fit #151

Closed wilsonmr closed 6 years ago

wilsonmr commented 6 years ago

Hello,

Having an issue running fits on the cluster at Edinburgh, the fits appear to be finishing however they are not outputting the LHAPDF grids, so we don't get the results.

@nhartland suggests it is a problem concerning APFEL.

I have pasted below the final bit of output from the fit, I have the full outputs available for a few different configurations of fits:

 **** Producing T0 Predictions with Set 170206-003

- Final Positivity Test
- Passed all points for POSF2U
- Passed all points for POSF2DW
- Passed all points for POSF2S
- Passed all points for POSFLL
- Passed all points for POSDYU
- Passed all points for POSDYD
- Passed all points for POSDYS

- Writing fitinfo file...
- Computing arclengths...
- Writing sumrules file...
- Writing preproc file...
- Writing params file...
- Writing out LHAPDF grid: 180321-edi-002
- Solving DGLAP for LHAPDF grid...
Thanks for using LHAPDF 6.2.1. Please make sure to cite the paper:
  Eur.Phys.J. C75 (2015) 3, 132  (http://arxiv.org/abs/1412.7420)

 Checking APFEL v3.0.2  ...

 WARNING: FONLL-C is a VFN scheme
          ... setting VFNS PDF evolution
 WARNING: FONLL-C is a NNLO scheme
          ... setting NNLO perturbative order

 Welcome to 
      _/_/_/    _/_/_/_/   _/_/_/_/   _/_/_/_/   _/
    _/    _/   _/    _/   _/         _/         _/
   _/_/_/_/   _/_/_/_/   _/_/_/     _/_/_/     _/
  _/    _/   _/         _/         _/         _/
 _/    _/   _/         _/         _/_/_/_/   _/_/_/_/
 _____v3.0.2 A PDF Evolution Library, arXiv:1310.1394
      Authors: V. Bertone, S. Carrazza, J. Rojo

 Report of the evolution parameters:

 QCD evolution
 Space-like evolution (PDFs)
 Unpolarized evolution
 Evolution scheme: VFNS at N2LO
 Solution of the DGLAP equation: 'exactalpha' with maximum 6 active flavours
 Solution of the coupling equations: 'exact' with maximum 6 active flavours
 Coupling reference value:
 - AlphaQCD(  1.4142 GeV) =  0.350000
 Pole heavy quark masses:
 - Mc =   1.4142 GeV
 - Mb =   4.5000 GeV
 - Mt = 175.0000 GeV
 The matching thresholds coincide with the physical masses
 muR / muF =  1.0000

 Allowed evolution range [   1.0000 :  10000.0000 ] GeV
 The internal subgrids will be locked
 Fast evolution enabled

 Initialization of the evolution completed in   4.256 s

 Report of the electroweak parameters:

 Mass of the Z = 91.188 GeV
 Mass of the W = 80.385 GeV
 Mass of the proton = 0.9383 GeV
 sin^2(thetaW) = 0.2313
 GFermi = 1.16638E-05
       | 0.9743 0.2254 0.0036 |
 CKM = | 0.2252 0.9734 0.0414 |
       | 0.0089 0.0405 0.9991 |
 Z propagator correction = 0.00000

 Report of the DIS parameters:

 Computation in the FONLL-C mass scheme
 Electromagnetic (EM) process
 Scattering electron - proton   
 muR / Q =  1.0000
 muF / Q =  1.0000
 Target Mass corrections disabled
 FONLL damping factor for charm enabled with suppression power = 2
 FONLL damping factor for bottom enabled with suppression power = 2
 FONLL damping factor for top enabled with suppression power = 2
 Intrinsic charm disabled

 Initialization of the DIS module completed in  45.390 s

 Check ... succeded

 WARNING: if there are external grids they cannot be locked
          ... unlocking subgrids

 Welcome to 
      _/_/_/    _/_/_/_/   _/_/_/_/   _/_/_/_/   _/
    _/    _/   _/    _/   _/         _/         _/
   _/_/_/_/   _/_/_/_/   _/_/_/     _/_/_/     _/
  _/    _/   _/         _/         _/         _/
 _/    _/   _/         _/         _/_/_/_/   _/_/_/_/
 _____v3.0.2 A PDF Evolution Library, arXiv:1310.1394
      Authors: V. Bertone, S. Carrazza, J. Rojo

 Report of the evolution parameters:

 QCD evolution
 Space-like evolution (PDFs)
 Unpolarized evolution
 Evolution scheme: VFNS at N2LO
 Solution of the DGLAP equation: 'truncated' with maximum 5 active flavours
 - value of the truncation parameter epsilon = 1.000E-02
 Solution of the coupling equations: 'expanded' with maximum 5 active flavours
 Coupling reference value:
 - AlphaQCD( 91.2000 GeV) =  0.118000
 Pole heavy quark masses:
 - Mc =   1.5100 GeV
 - Mb =   4.9200 GeV
 - Mt = 172.5000 GeV
 The matching thresholds coincide with the physical masses
 muR / muF =  1.0000

 Allowed evolution range [   1.6500 : 100000.0000 ] GeV

 Initialization of the evolution completed in 917.283 s

 In odeintns.f:
 stepsize underflow in rkqsns
wilsonmr commented 6 years ago

Will do

wilsonmr commented 6 years ago

Yeah ok, 8 out of 10 replicas successfully ran and produced all files so it seems fast evolution is the origin of the problem

scarrazza commented 6 years ago

Great, thanks. I will have a look.

scarrazza commented 6 years ago

Could you please revert nnpdfcpp to master, then modify, recompile APFEL (fixstringlogics branch) with the diff below and then rerun again nnfit?

diff --git a/src/Evolution/odeintnsQCD.f b/src/Evolution/odeintnsQCD.f
index c75df28..7a74651 100644
--- a/src/Evolution/odeintnsQCD.f
+++ b/src/Evolution/odeintnsQCD.f
@@ -73,6 +73,7 @@
 *
       write(6,*) "In odeintns.f:"
       write(6,*) "too many steps!"
+      write(6,*) i,mu21,mu22,ystart,y
       call exit(-10)
 *
       return
wilsonmr commented 6 years ago

ok so some replicas got the old error

In odeintns.f:
 stepsize underflow in rkqsns

however some clearly outputted that extra stuff, it's a bit long to post here but there are some NaNs, is that what you were saying could be an issue? I can send you the full output via email if you want?

scarrazza commented 6 years ago

Yes, please send me by mail, I guess we are close to some variable not initialized properly or numerical rounding due to your cluster setup.

scarrazza commented 6 years ago

Ok, could you please reset APFEL (from the PR) and apply this patch then reset nnfit, recompile everything and rerun? (this will check if the problem is cluster memory or not). Could you please tell me the gfortran version you are using (gfortran --version)?

diff --git a/src/commons/grid.h b/src/commons/grid.h
index 1dff3cb..32c1f77 100644
--- a/src/commons/grid.h
+++ b/src/commons/grid.h
@@ -1,7 +1,7 @@
 *     -*-fortran-*-

       integer nint_max
-      parameter(nint_max=200)
+      parameter(nint_max=50)
       integer nint_max_DIS
       parameter(nint_max_DIS=120)
       integer ngrid_max
wilsonmr commented 6 years ago

Just to confirm, which branch of apfel should I recompile.. fixstringleak, fixstringlogics or master. Using conda gfortran but it relies on the flag $GFORTRAN for example:

(apfeltest) [s1758208@login04(eddie) s1758208]$ gfortran --version
GNU Fortran (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
Copyright (C) 2015 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

(apfeltest) [s1758208@login04(eddie) s1758208]$ $GFORTRAN --version
GNU Fortran (crosstool-NG fa8859cb) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
wilsonmr commented 6 years ago

but the compilation, I'm pretty sure, is using the correct compilers

checking whether we are using the GNU C++ compiler... yes
checking whether /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/x86_64-conda_cos6-linux-gnu-c++ accepts -g... yes
checking for style of include used by make... GNU
checking dependency style of /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/x86_64-conda_cos6-linux-gnu-c++... gcc3
checking whether we are using the GNU Fortran compiler... yes
checking whether /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/x86_64-conda_cos6-linux-gnu-gfortran accepts -g... yes
scarrazza commented 6 years ago

Always fixstringleak, as PR #9. So are you sure your compiler is > 4.x.x?

Zaharid commented 6 years ago

That seems to be using gfortran 7.2

wilsonmr commented 6 years ago

I'm fairly certain yes, like Tommaso mentioned in the code call we are using the method described here http://pcteserver.mi.infn.it/~nnpdf/validphys-docs/guide.html#development-installs however for compiling apfel we take the additional steps of installing the conda gfortran compilers and remove/compile apfel from source

I guess it relies on conda properly setting environment variables when I enter environment which I'm confident it does and apfel compilation using them, which again I think it does

wilsonmr commented 6 years ago

I should say without debug flags the only warning I get when compiling apfel is

libtool: warning: library '/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.2.0/../../../../x86_64-conda_cos6-linux-gnu/lib/../lib/libstdc++.la' was moved.

and

cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
Zaharid commented 6 years ago

Those warnings are fine. ​

wilsonmr commented 6 years ago

@scarrazza do I need to edit something else?

In initGrid.f:
 Number of grid points too large:
 found =          83
 Maximum value allowed =          50
 You should reduce it.
scarrazza commented 6 years ago

Ok, please set nint_max=100 instead of 50.

scarrazza commented 6 years ago

Concerning gfortran, could you please post here the output of ldd <anaconda env>/libAPFEL.so? Thanks.

wilsonmr commented 6 years ago

It says it's not a dynamic excecutable

wilsonmr commented 6 years ago

does this help? I guess not.

readelf -d /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libAPFEL.so

Dynamic section at offset 0x17cd58 contains 28 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libgfortran.so.4]
 0x0000000000000001 (NEEDED)             Shared library: [libquadmath.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x000000000000000e (SONAME)             Library soname: [libAPFEL.so.0]
 0x000000000000000f (RPATH)              Library rpath: [/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib:/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.2.0/../../../../x86_64-conda_cos6-linux-gnu/lib/../lib]
 0x000000000000000c (INIT)               0x17750
 0x000000000000000d (FINI)               0xf69e0
 0x0000000000000019 (INIT_ARRAY)         0x37ccc8
 0x000000000000001b (INIT_ARRAYSZ)       48 (bytes)
 0x0000000000000004 (HASH)               0x190
 0x0000000000000005 (STRTAB)             0xa1a0
 0x0000000000000006 (SYMTAB)             0x25f8
 0x000000000000000a (STRSZ)              26368 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x37cf58
 0x0000000000000007 (RELA)               0x11420
 0x0000000000000008 (RELASZ)             25392 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x0000000000000018 (BIND_NOW)           
 0x000000006ffffffb (FLAGS_1)            Flags: NOW
 0x000000006ffffffe (VERNEED)            0x112f0
 0x000000006fffffff (VERNEEDNUM)         5
 0x000000006ffffff0 (VERSYM)             0x108a0
 0x000000006ffffff9 (RELACOUNT)          9
 0x0000000000000000 (NULL)               0x0
Zaharid commented 6 years ago

Can you do strace ldd libAPFEL.so and paste the whole oputut? I guess it is getting out of RAM somewhere.

wilsonmr commented 6 years ago

yes probably, the master nodes have become really restrictive recently

`strace ldd libAPFEL.so` ``` strace ldd /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libAPFEL.so execve("/usr/bin/ldd", ["ldd", "/exports/csce/eddie/ph/groups/rb"...], [/* 101 vars */]) = 0 brk(0) = 0x1123000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f11d3fd0000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=127162, ...}) = 0 mmap(NULL, 127162, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f11d3fb0000 close(3) = 0 open("/lib64/libtinfo.so.5", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\316\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=174520, ...}) = 0 mmap(NULL, 2268928, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f11d3b86000 mprotect(0x7f11d3bab000, 2097152, PROT_NONE) = 0 mmap(0x7f11d3dab000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f11d3dab000 close(3) = 0 open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\16\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=19520, ...}) = 0 mmap(NULL, 2109744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f11d3982000 mprotect(0x7f11d3985000, 2093056, PROT_NONE) = 0 mmap(0x7f11d3b84000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f11d3b84000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \34\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=2112384, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f11d3faf000 mmap(NULL, 3936832, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f11d35c0000 mprotect(0x7f11d3777000, 2097152, PROT_NONE) = 0 mmap(0x7f11d3977000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b7000) = 0x7f11d3977000 mmap(0x7f11d397d000, 16960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f11d397d000 close(3) = 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f11d3fad000 arch_prctl(ARCH_SET_FS, 0x7f11d3fad740) = 0 mprotect(0x7f11d3977000, 16384, PROT_READ) = 0 mprotect(0x7f11d3b84000, 4096, PROT_READ) = 0 mprotect(0x7f11d3dab000, 16384, PROT_READ) = 0 mprotect(0x6dc000, 4096, PROT_READ) = 0 mprotect(0x7f11d3fd1000, 4096, PROT_READ) = 0 munmap(0x7f11d3fb0000, 127162) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open("/dev/tty", O_RDWR|O_NONBLOCK) = 3 close(3) = 0 brk(0) = 0x1123000 brk(0x1144000) = 0x1144000 brk(0) = 0x1144000 open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=106065056, ...}) = 0 mmap(NULL, 106065056, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f11cd099000 close(3) = 0 brk(0) = 0x1144000 getuid() = 1824569 getgid() = 1608220 geteuid() = 1824569 getegid() = 1608220 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open("/proc/meminfo", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f11d3fcf000 read(3, "MemTotal: 65562448 kB\nMemF"..., 1024) = 1024 close(3) = 0 munmap(0x7f11d3fcf000, 4096) = 0 rt_sigaction(SIGCHLD, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGCHLD, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, 8) = 0 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, 8) = 0 rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7f11d35f5670}, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, 8) = 0 uname({sys="Linux", node="login04.ecdf.ed.ac.uk", ...}) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=26254, ...}) = 0 mmap(NULL, 26254, PROT_READ, MAP_SHARED, 3, 0) = 0x7f11d3fc9000 close(3) = 0 stat("/exports/eddie/scratch/s1758208", {st_mode=S_IFDIR|0700, st_size=512, ...}) = 0 stat(".", {st_mode=S_IFDIR|0700, st_size=512, ...}) = 0 getpid() = 123393 getppid() = 123390 getpgrp() = 123390 rt_sigaction(SIGCHLD, {0x441170, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, 8) = 0 getrlimit(RLIMIT_NPROC, {rlim_cur=200, rlim_max=200}) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open("/usr/bin/ldd", O_RDONLY) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffeb2682210) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, "#! /usr/bin/bash\n# Copyright (C)"..., 80) = 80 lseek(3, 0, SEEK_SET) = 0 getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=4*1024}) = 0 fcntl(255, F_GETFD) = -1 EBADF (Bad file descriptor) dup2(3, 255) = 255 close(3) = 0 fcntl(255, F_SETFD, FD_CLOEXEC) = 0 fcntl(255, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) fstat(255, {st_mode=S_IFREG|0755, st_size=5302, ...}) = 0 lseek(255, 0, SEEK_CUR) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(255, "#! /usr/bin/bash\n# Copyright (C)"..., 5302) = 5302 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=2502, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f11d3fc8000 read(3, "# Locale name alias data base.\n#"..., 4096) = 2502 read(3, "", 4096) = 0 close(3) = 0 munmap(0x7f11d3fc8000, 4096) = 0 open("/usr/share/locale/en_GB.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en_GB.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en_GB/LC_MESSAGES/libc.mo", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=1474, ...}) = 0 mmap(NULL, 1474, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f11d3fc8000 close(3) = 0 open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 fcntl(2, F_GETFD) = 0 fcntl(2, F_DUPFD, 10) = 10 fcntl(2, F_GETFD) = 0 fcntl(10, F_SETFD, FD_CLOEXEC) = 0 dup2(3, 2) = 2 close(3) = 0 dup2(10, 2) = 2 fcntl(10, F_GETFD) = 0x1 (flags FD_CLOEXEC) close(10) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 stat("/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libAPFEL.so", {st_mode=S_IFREG|0755, st_size=1667376, ...}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 stat("/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libAPFEL.so", {st_mode=S_IFREG|0755, st_size=1667376, ...}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 faccessat(AT_FDCWD, "/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libAPFEL.so", R_OK) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 faccessat(AT_FDCWD, "/exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libAPFEL.so", X_OK) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 faccessat(AT_FDCWD, "/lib/ld-linux.so.2", X_OK) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 pipe([3, 4]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 lseek(255, -52, SEEK_CUR) = 5250 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f11d3fada10) = 123394 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGCHLD, {0x441170, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, {0x441170, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, 8) = 0 close(4) = 0 read(3, "", 128) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=123394, si_status=1, si_utime=0, si_stime=0} --- wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 123394 wait4(-1, 0x7ffeb2680b50, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn() = 0 close(3) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGINT, {0x43e5e0, [], SA_RESTORER, 0x7f11d35f5670}, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, {0x43e5e0, [], SA_RESTORER, 0x7f11d35f5670}, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 faccessat(AT_FDCWD, "/lib64/ld-linux-x86-64.so.2", X_OK) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 pipe([3, 4]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f11d3fada10) = 123395 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGCHLD, {0x441170, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, {0x441170, [], SA_RESTORER|SA_RESTART, 0x7f11d35f5670}, 8) = 0 close(4) = 0 read(3, "", 128) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=123395, si_status=1, si_utime=0, si_stime=0} --- wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 123395 wait4(-1, 0x7ffeb2680b50, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn() = 0 close(3) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGINT, {0x43e5e0, [], SA_RESTORER, 0x7f11d35f5670}, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f11d35f5670}, {0x43e5e0, [], SA_RESTORER, 0x7f11d35f5670}, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 faccessat(AT_FDCWD, "/libx32/ld-linux-x32.so.2", X_OK) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 365), ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f11d3fc7000 write(1, "\tnot a dynamic executable\n", 26 not a dynamic executable ) = 26 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(255, "\nexit $result\n# Local Variables:"..., 5302) = 52 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 exit_group(1) = ? +++ exited with 1 +++ ```
wilsonmr commented 6 years ago

ok I submitted a job that ran ldd libAPFEL.so

linux-vdso.so.1 =>  (0x00007ffc9cdbf000)
    libgfortran.so.4 => /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libgfortran.so.4 (0x00002af779f69000)
    libquadmath.so.0 => /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libquadmath.so.0 (0x00002af77a293000)
    libstdc++.so.6 => /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libstdc++.so.6 (0x00002af77a4c4000)
    libm.so.6 => /lib64/libm.so.6 (0x00002af77a7fe000)
    libc.so.6 => /lib64/libc.so.6 (0x00002af77ab00000)
    libgcc_s.so.1 => /exports/csce/eddie/ph/groups/rbm_ml/michael/myconda/envs/apfeltest/lib/libgcc_s.so.1 (0x00002af77aec1000)
    /lib64/ld-linux-x86-64.so.2 (0x00002af6c072d000)
wilsonmr commented 6 years ago

sorry I'm trying to run that thing you sent but I'm just getting seg faults. Maybe I did something wrong

scarrazza commented 6 years ago

Good, are you getting the segfault before or after the point where the too many steps stop appears?

wilsonmr commented 6 years ago

before (in fact before anything is being outputted):

PDFBasis:: initialised basis: NN31ICFitBasis
Selecting FitBasis: NN31IC
** New Log File Generated. Log 'GAMinimizer' at nmcapfel/nnfit/replica_1/GAMin.log
Minimiser: Genetic Algorithm w/ nodal mutations
PDF: NNPDF_Fit  ErrorType: No Errors booked
/var/spool/gridscheduler/execd/node2b01/job_scripts/2529128: line 31: 23230 Segmentation fault      (core dumped) nnfit $SGE_TASK_ID nmcapfel
scarrazza commented 6 years ago

My guess is that the cluster node cannot allocate the memory required by APFEL. Thanks for the information, I think I'm getting close to your installation setup (libgfortran.so.4, gcc/gfortran 7.2.0 from conda, etc.) I will let you know if I manage to reproduce your crash.

Meanwhile could you please revert to the original APFEL (fixstringleak) and nnpdf (master), and then compile APFEL with sanitizer enabled (apply the diff below and then run autoreconf -i):

diff --git a/configure.ac b/configure.ac
index 2509e11..488a799 100644
--- a/configure.ac
+++ b/configure.ac 
@@ -13,7 +13,7 @@ AC_CONFIG_HEADERS([config/config.h include/APFEL/FortranWrappers.h])

 ## Set Fortran compiler behaviour
 if test "x$FCFLAGS" == "x"; then
-  FCFLAGS="-O3 -Wunused"
+  FCFLAGS="-g -O3 -Wunused -fstack-protector-strong -fsanitize=address"
 fi
 # Try to respect users' Fortran compiler variables
 if test "x$FC" == "x"; then
@@ -129,7 +129,7 @@ fi

 ## Set final FCFLAGS, CXXFLAGS and CPPFLAGS
-AM_CPPFLAGS="$AM_CPPFLAGS -I\$(top_srcdir)/include -I\$(top_builddir)/include"
+AM_CPPFLAGS="$AM_CPPFLAGS  -fstack-protector-strong -I\$(top_srcdir)/include -I\$(top_builddir)/include"
 AM_CPPFLAGS=["$AM_CPPFLAGS -DDATA_PATH="$datadir" -DAPFEL_VERSION="$PACKAGE_VERSION" "]
 AC_SUBST(AM_CPPFLAGS)

And then configure nnpdf in Debug mode (you can easily change that with ccmake), compile and rerun nnfit?

wilsonmr commented 6 years ago

I did that and I can't seem to run filter or nnfit getting

==53668==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
==53668==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
scarrazza commented 6 years ago

From my side I have setup exactly the same environment you have on my laptop and server, both of them work well (no leaks, faults).

I think the asan error you quote is another evidence of low memory (less than 3.7gb).

I would like to check your cluster submission instructions but I do not have permission to view the wiki page you have linked. Could please send me as PDF?

Zaharid commented 6 years ago

I'd like to try ASAN on a system where the sysadmins enforce a hard limit. Is this possible?

No. asan requires 20Tb (+ a bit) of virtual memory to properly function. talk to your sysadmins to relax their limits

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55517#c9

wilsonmr commented 6 years ago

I guess this isn't so relevant but just in case you want to play with ulimit, I am in an interactive session on one of the job nodes and this is the output of ulimit:

[s1758208@node1h19 ~]$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514047
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) 4194304
file locks                      (-x) unlimited

I am currently running the same fit on my desktop. Using an almost identical setup to the cluster in terms of conda environment. If it succeeds I will put some more effort into recreating the exact install procedure I used on the cluster and see if I can recreate the error

wilsonmr commented 6 years ago

ok I got access to a node and ran nnfit in debug mode, do I need to run something specific to get asan output?

 **** Producing T0 Predictions with Set NNPDF31_nnlo_as_0118

- Final Positivity Test
- Positivity Vetoed

- Writing fitinfo file...
- Computing arclengths...
- Writing sumrules file...
- Writing preproc file...
- Writing params file...
- Printing grid to file: nmcapfel/nnfit/replica_1/nmcapfel.gridvalues
- Writing out LHAPDF grid: nmcapfel
- Solving DGLAP for LHAPDF grid...
 In odeintns.f:
 too many steps!
Thanks for using LHAPDF 6.2.1. Please make sure to cite the paper:
  Eur.Phys.J. C75 (2015) 3, 132  (http://arxiv.org/abs/1412.7420)

=================================================================
==196341==ERROR: LeakSanitizer: detected memory leaks

Indirect leak of 330288 byte(s) in 768 object(s) allocated from:
    #0 0x7f3ceca63afc in __interceptor_malloc /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libsanitizer/asan/asan_malloc_linux.cc:62
    #1 0x7f3c30e51754 in sqlite3MemMalloc (/exports/csce/eddie/ph/groups/rbm_ml/michael/miniconda/envs/nnpdf-dev/lib/libsqlite3.so.0+0x40754)

Indirect leak of 4968 byte(s) in 15 object(s) allocated from:
    #0 0x7f3ceca63df8 in __interceptor_realloc /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libsanitizer/asan/asan_malloc_linux.cc:75
    #1 0x7f3c30e516f9 in sqlite3MemRealloc (/exports/csce/eddie/ph/groups/rbm_ml/michael/miniconda/envs/nnpdf-dev/lib/libsqlite3.so.0+0x406f9)

SUMMARY: AddressSanitizer: 335256 byte(s) leaked in 783 allocation(s).
wilsonmr commented 6 years ago

it seems as if the address sanitiser hasn't told us anything though? sqlite leaks was already known and it would have exited earlier if there was a leak with apfel

Zaharid commented 6 years ago

Is this error related in any way to the compilation options? As said earlier, my bet is that this error is simply saying that the replicas are not too smooth, due to the small number of iterations. Do you get anything different if you compile apfel without the stack protector?

On Tue, May 29, 2018 at 3:15 PM, wilsonmr notifications@github.com wrote:

it seems as if the address sanitiser hasn't told us anything though? sqlite leaks was already known and it would have exited earlier if there was a leak with apfel

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NNPDF/nnpdf/issues/151#issuecomment-392792453, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUsFgnCyNX3DDDPpy6irgu7MqG0a3ks5t3Vf_gaJpZM4S7Ce8 .

wilsonmr commented 6 years ago

Trying that now, I chnged the flags and removed -fstack-protector-strong and replaced with -fno-stack-protector is that correct?

Zaharid commented 6 years ago

Yes, I'd be curious if that makes a difference.

wilsonmr commented 6 years ago

As far as I can tell, provided that I did it correctly I don't think it made any difference

wilsonmr commented 6 years ago

I ran using the latest conda package, no compilation, I get the

 In odeintns.f:
 stepsize underflow in rkqsns

error. I guess I should try the tests @scarrazza was mentioning in the phone conference

Zaharid commented 6 years ago

Can you post the runcard you are using? I may try this as well...

On Thu, May 31, 2018 at 3:54 PM, wilsonmr notifications@github.com wrote:

I ran using the latest conda package, no compilation, I get the

In odeintns.f: stepsize underflow in rkqsns

error. I guess I should try the tests @scarrazza https://github.com/scarrazza was mentioning in the phone conference

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NNPDF/nnpdf/issues/151#issuecomment-393557577, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUnl5MJ9rX0FZHXWzTT_OAkKn1v7Kks5t4AQrgaJpZM4S7Ce8 .

wilsonmr commented 6 years ago

Try this, the description is nonsense. I will also try using the new package on my work desktop

Click to expand ``` # # Configuration file for NNPDF++ # ############################################################ description: NNPDF3.1 NNLO global with off peak ATLASWZRAP11 ############################################################ # frac: training fraction # ewk: apply ewk k-factors # sys: systematics treatment (see systypes) experiments: # Fixed target DIS - experiment: NMC datasets: - { dataset: NMCPD, frac: 0.5 } ############################################################ datacuts: t0pdfset : NNPDF31_nnlo_as_0118 # PDF set to generate t0 covmat q2min : 3.49 # Q2 minimum w2min : 12.5 # W2 minimum combocuts : NNPDF31 # NNPDF3.0 final kin. cuts jetptcut_tev : 0 # jet pt cut for tevatron jetptcut_lhc : 0 # jet pt cut for lhc wptcut_lhc : 30.0 # Minimum pT for W pT diff distributions jetycut_tev : 1e30 # jet rap. cut for tevatron jetycut_lhc : 1e30 # jet rap. cut for lhc dymasscut_min: 0 # dy inv.mass. min cut dymasscut_max: 1e30 # dy inv.mass. max cut jetcfactcut : 1e30 # jet cfact. cut ############################################################ theory: theoryid: 53 # database id ############################################################ fitting: seed : 14532133528 # set the seed for the random generator genrep : on # on = generate MC replicas, off = use real data rngalgo : 0 # 0 = ranlux, 1 = cmrg, see randomgenerator.cc fitmethod: NGA # Minimization algorithm ngen : 10 # Maximum number of generations nmutants : 80 # Number of mutants for replica paramtype: NN nnodes : [2,5,3,1] # NN23(QED) = sng=0,g=1,v=2,t3=3,ds=4,sp=5,sm=6,(pht=7) # EVOL(QED) = sng=0,g=1,v=2,v3=3,v8=4,t3=5,t8=6,(pht=7) # EVOLS(QED)= sng=0,g=1,v=2,v8=4,t3=4,t8=5,ds=6,(pht=7) # FLVR(QED) = g=0, u=1, ubar=2, d=3, dbar=4, s=5, sbar=6, (pht=7) fitbasis: NN31IC # EVOL (7), EVOLQED (8), etc. basis: # remeber to change the name of PDF accordingly with fitbasis # pos: on for NN squared # mutsize: mutation size # mutprob: mutation probability # smallx, largex: preprocessing ranges - { fl: sng, pos: off, mutsize: [15], mutprob: [0.05], smallx: [1.04,1.20], largex: [1.45,2.64] } - { fl: g, pos: off, mutsize: [15], mutprob: [0.05], smallx: [0.82,1.31], largex: [0.20,6.17] } - { fl: v, pos: off, mutsize: [15], mutprob: [0.05], smallx: [0.51,0.71], largex: [1.24,2.80] } - { fl: v3, pos: off, mutsize: [15], mutprob: [0.05], smallx: [0.23,0.63], largex: [1.02,3.14] } - { fl: v8, pos: off, mutsize: [15], mutprob: [0.05], smallx: [0.53,0.75], largex: [0.70,3.31] } - { fl: t3, pos: off, mutsize: [15], mutprob: [0.05], smallx: [-0.45,1.41], largex: [1.78,3.21] } - { fl: t8, pos: off, mutsize: [15], mutprob: [0.05], smallx: [0.49,1.32], largex: [1.42,3.13] } - { fl: cp, pos: off, mutsize: [15], mutprob: [0.05], smallx: [-0.07,1.13], largex: [1.73,7.37] } ############################################################ stopping: stopmethod: LOOKBACK # Stopping method lbdelta : 0 # Delta for look-back stopping mingen : 0 # Minimum number of generations window : 500 # Window for moving average minchi2 : 3.5 # Minimum chi2 minchi2exp: 6.0 # Minimum chi2 for experiments nsmear : 200 # Smear for stopping deltasm : 200 # Delta smear for stopping rv : 2 # Ratio for validation stopping rt : 0.5 # Ratio for training stopping epsilon : 1e-6 # Gradient epsilon ############################################################ positivity: posdatasets: - { dataset: POSF2U, poslambda: 1e6 } # Positivity Lagrange Multiplier - { dataset: POSF2DW, poslambda: 1e6 } - { dataset: POSF2S, poslambda: 1e6 } - { dataset: POSFLL, poslambda: 1e6 } - { dataset: POSDYU, poslambda: 1e10 } - { dataset: POSDYD, poslambda: 1e10 } - { dataset: POSDYS, poslambda: 1e10 } ############################################################ closuretest: filterseed : 0 # Random seed to be used in filtering data partitions fakedata : off # on = to use FAKEPDF to generate pseudo-data fakepdf : MSTW2008nlo68cl # Theory input for pseudo-data errorsize : 1.0 # uncertainties rescaling fakenoise : off # on = to add random fluctuations to pseudo-data rancutprob : 1.0 # Fraction of data to be included in the fit rancutmethod: 0 # Method to select rancutprob data fraction rancuttrnval: off # 0(1) to output training(valiation) chi2 in report printpdf4gen: off # To print info on PDFs during minimization ############################################################ lhagrid: nx : 100 xmin: 1e-9 xmed: 0.1 xmax: 1.0 nq : 50 qmax: 1e5 ############################################################ debug: off ```
Zaharid commented 6 years ago

Does anyone know why APFEL prints Intrinsic charm disabled on theory 53?

scarrazza commented 6 years ago

The first splash is called by CheckAPFEL which uses a default setup. The second splash screen should quote the variables correctly.

Zaharid commented 6 years ago

I have run the runcard above once with the conda packages and got a segfault after initializing apfel. I have run it a second and third time and it is still running (may not lt it finish since I am not in the mood of googling how to use screen).

I then compiled both nnpdf and apfel (master versions of both) with all the debug flags (except -implicit-none, which causes apfel to not compile) and got this from ASAN:

Initialization of the DIS module completed in  41.170 s

 Check ... succeded

ASAN:DEADLYSIGNAL
=================================================================
==7772==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffce2068ba8 (pc 0x7f86015639f9 bp 0x7ffc710340c0 sp 0x7ffc710338c8 T0)
==7772==The signal is caused by a READ memory access.
    #0 0x7f86015639f8 in _gfortran_string_len_trim /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libgfortran/intrinsics/string_intrinsics_inc.c:218
    #1 0x7f860386e0a9 in setpdfset_ (/home/zaharik/miniconda3/lib/libAPFEL.so.0+0x280a9)
    #2 0x7f8603939ad1 in APFEL::SetPDFSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (/home/zaharik/miniconda3/lib/libAPFEL.so.0+0xf3ad1)
    #3 0x7f86be54019c in APFELSingleton::Initialize(NNPDFSettings const&, NNPDF::PDFSet* const&) /home/zaharik/nnpdf/nnpdfcpp/src/nnfit/src/apfelevol.cc:278
    #4 0x7f86be582a10 in FitPDFSet::FitPDFSet(NNPDFSettings const&, FitBasis*) /home/zaharik/nnpdf/nnpdfcpp/src/nnfit/src/fitpdfset.cc:52
    #5 0x7f86be535498 in FitPDFSet* FitPDFSet::Generate<NNPDF::MultiLayerPerceptron, GAMinimizer>(NNPDFSettings const&, FitBasis*) /home/zaharik/nnpdf/nnpdfcpp/src/nnfit/inc/fitpdfset.h:39
    #6 0x7f86be50a614 in main /home/zaharik/nnpdf/nnpdfcpp/src/nnfit/src/nnfit.cc:165
    #7 0x7f860287dd1c in __libc_start_main (/lib64/libc.so.6+0x3b7de1ed1c)
    #8 0x7f86be522d9b  (/home/zaharik/miniconda3/envs/apfel-dbg/bin/nnfit+0x4cd9b)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libgfortran/intrinsics/string_intrinsics_inc.c:218 in _gfortran_string_len_trim
==7772==ABORTING

@scarrazza can you see what is the problem?

scarrazza commented 6 years ago

Thanks, let me try to reproduce that.

Zaharid commented 6 years ago

Annoyingly enough, it doesn't happen always to me, even when I rerun the same thing.

Zaharid commented 6 years ago

I also got:

 Checking APFEL v3.0.2  ...
At line 8 of file DIS/SetProjectileDIS.f
Fortran runtime error: Actual string length is shorter than the declared one for dummy argument 'lept' (8/12)
scarrazza commented 6 years ago

Could you please post here all the gfortran flags you are using?

Zaharid commented 6 years ago

I have

$ echo $FFLAGS
-fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fcheck=all -fbacktrace -fvar-tracking-assignments -pipe

this is like DEBUG_FFLAGS in a conda environment but removing -fimplicit-none.

scarrazza commented 6 years ago

Great, thanks. I can reproduce your error messages and looks like my PR is just 1% of the fix, so I have to extend the fix to all places where the dummy string size is set to a custom number. Moreover the compilation warnings look very bad.

scarrazza commented 6 years ago

We managed to isolate the issue and confirm that there is a memory issue, see https://github.com/scarrazza/apfel/pull/11. However the fix, if any, is not trivial.

scarrazza commented 6 years ago

See discussion in #173 .