gflow / GFlow

Software for modeling circuit theory-based connectivity
GNU General Public License v3.0
22 stars 4 forks source link

mpiexec noticed that process rank 0 with PID 1969 on node ubuntu exited on signal 9 (Killed) #11

Closed gioman closed 7 years ago

gioman commented 7 years ago

On Ubuntu 16.10 on a machine with 16gb of RAM when launching the "execute_sample.sh" script I get:

gio@ubuntu:~/GFlow$ sh execute_example.sh
/usr/bin/mpiexec
Tue Dec 13 13:12:40 WET 2016
--------------------------------------------------------------------------
[[1532,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ubuntu

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Tue Dec 13 13:12:40 2016 >> Effective resistance will be written to ./R_eff.csv.
Tue Dec 13 13:12:40 2016 >> Simulation will converge at 0.9
Tue Dec 13 13:12:42 2016 >> (rows,cols) = (6611,10493)
[ubuntu:01967] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[ubuntu:01967] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Tue Dec 13 13:13:00 2016 >> Removed 11 islands (80259 cells).
Tue Dec 13 13:13:00 2016 >> 587 points in nodes
Tue Dec 13 13:13:00 2016 >> Max distance: 148148.15 pixels
Tue Dec 13 13:13:00 2016 >> 171991 pairs generated.  0 skipped.
Tue Dec 13 13:13:00 2016 >> Number of unknowns: 34167982
Tue Dec 13 13:13:18 2016 >> Solving pair 0 (1 of 171991): 204[2505,8923] to 510[4405,7968].  574.16 Km apart
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 1969 on node ubuntu exited on signal 9 (Killed).
--------------------------------------------------------------------------

@FrankVal

gioman commented 7 years ago

seen this in logs

Dec 13 13:13:32 ubuntu kernel: [ 86.041443] Out of memory: Kill process 1969 (gflow.x) score 328 or sacrifice child Dec 13 13:13:32 ubuntu kernel: [ 86.041479] Killed process 1969 (gflow.x) total-vm:6002488kB, anon-rss:792508kB, file-rss:292kB, shmem-rss:0kB Dec 13 13:13:32 ubuntu kernel: [ 86.106863] oom_reaper: reaped process 1969 (gflow.x), now anon-rss:4kB, file-rss:0kB, shmem-rss:0kB

pbleonard commented 7 years ago

Interesting...I just tested on 16Gb yesterday and no memory errors.

Also, yes I have had problems using earlier versions of petsc (<.3.7.3) which is default dev with Ubuntu 16.10 but still phill dig some. Should work with LTR and petsc 3.7.3.

Will try some things later this AM

On Tuesday, December 13, 2016, Giovanni Manghi <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

seen this in logs

Dec 13 13:13:32 ubuntu kernel: [ 86.041443] Out of memory: Kill process 1969 (gflow.x) score 328 or sacrifice child Dec 13 13:13:32 ubuntu kernel: [ 86.041479] Killed process 1969 (gflow.x) total-vm:6002488kB, anon-rss:792508kB, file-rss:292kB, shmem-rss:0kB Dec 13 13:13:32 ubuntu kernel: [ 86.106863] oom_reaper: reaped process 1969 (gflow.x), now anon-rss:4kB, file-rss:0kB, shmem-rss:0kB

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Pbleonard_GFlow_issues_11-23issuecomment-2D266737587&d=CwMFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=JFNhIwKlMoiVOcCCQu5eBCOCKlJciwyZGlNhbQn1DPk&m=q2pzF6bumyTf_EpvWFH0d_jjTVR-RxCLO5vJIJRn-Go&s=r607tDT9yKFHJykXX53VT9t_kAxk3QUjoL3qyouYDlk&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ASoaTGrkcoYBT3CSvhD8q-2D2hW46yYhC3ks5rHpzhgaJpZM4LLuER&d=CwMFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=JFNhIwKlMoiVOcCCQu5eBCOCKlJciwyZGlNhbQn1DPk&m=q2pzF6bumyTf_EpvWFH0d_jjTVR-RxCLO5vJIJRn-Go&s=KVv_FCGmQ3vX0LnFgNKSjTCDT_r5KD6FvMNhmiFDf6M&e= .

-- Sent from Gmail Mobile

gioman commented 7 years ago

Also, yes I have had problems using earlier versions of petsc (<.3.7.3) which is default dev with Ubuntu 16.10 but still phill dig some. Should work with LTR and petsc 3.7.3

I will try to install/compile petsc >= 3.7.3 on Ubuntu 16.04 and let you know.

gioman commented 7 years ago

@Pbleonard

Interesting...I just tested on 16Gb yesterday and no memory errors.

will test again on another machine asap.

Meanwhile I'm testing also on a Macbook pro (8gb of ram, i5 2.6ghz): installing dependencies and compiling was ok. Now I started the execute_example script with your sample data

Giovannis-MBP:GFlow giovanni$ sh execute_example.sh
/usr/local/bin/mpiexec
Tue Dec 13 16:46:36 WET 2016
Tue Dec 13 16:46:37 2016 >> Effective resistance will be written to ./R_eff.csv.
Tue Dec 13 16:46:37 2016 >> Simulation will converge at 0.9
Tue Dec 13 16:46:39 2016 >> (rows,cols) = (6611,10493)
Tue Dec 13 16:46:45 2016 >> Removed 11 islands (80259 cells).
Tue Dec 13 16:46:45 2016 >> 587 points in nodes
Tue Dec 13 16:46:45 2016 >> Max distance: 148148.15 pixels
Tue Dec 13 16:46:45 2016 >> 171991 pairs generated.  0 skipped.
Tue Dec 13 16:46:45 2016 >> Number of unknowns: 34167982
Tue Dec 13 16:47:27 2016 >> Solving pair 0 (1 of 171991): 422[3926,2380] to 564[5069,8782]. 1755.87 Km apart
Tue Dec 13 17:15:55 2016 >> R_eff = 422,564,79.544963
Tue Dec 13 17:15:55 2016 >> Estimated time remaining: 81588:16:19
Tue Dec 13 17:15:55 2016 >> Solving pair 1 (2 of 171991): 305[3192,6667] to 524[4510,1132]. 1536.23 Km apart
Tue Dec 13 17:18:28 2016 >> Solution to iteration 0 discarded.
Tue Dec 13 17:18:40 2016 >> convergence-factor = 0.000000e+00 (0-N)

it took almost 30 minutes to solve the first pair, is that expected?

thanks in advance

pbleonard commented 7 years ago

Out of office today. How many cores are you using in example script? Can send you smaller inputs to test tomorrow but you may also want to check benchmarks in manuscript for rough estimate. Example inputs are rather large for that hardware

On Tuesday, December 13, 2016, Giovanni Manghi notifications@github.com wrote:

@Pbleonard https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Pbleonard&d=CwMFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=JFNhIwKlMoiVOcCCQu5eBCOCKlJciwyZGlNhbQn1DPk&m=CRQNKIaOI9RADY0Hh5WZ6bwtZN5TvfXYo1-arECQcXE&s=6x_YqfnisGvnwZ4jF7_qDn0YBNHWEGG_1tMQs9Zs_l4&e=

Interesting...I just tested on 16Gb yesterday and no memory errors.

will test again on another machine asap.

Meanwhile I'm testing also on a Macbook pro (8gb of ram, i5 2.6ghz): installing dependencies and compiling was ok. Now I started the execute_example script with your sample data

Giovannis-MBP:GFlow giovanni$ sh execute_example.sh /usr/local/bin/mpiexec Tue Dec 13 16:46:36 WET 2016 Tue Dec 13 16:46:37 2016 >> Effective resistance will be written to ./R_eff.csv. Tue Dec 13 16:46:37 2016 >> Simulation will converge at 0.9 Tue Dec 13 16:46:39 2016 >> (rows,cols) = (6611,10493) Tue Dec 13 16:46:45 2016 >> Removed 11 islands (80259 cells). Tue Dec 13 16:46:45 2016 >> 587 points in nodes Tue Dec 13 16:46:45 2016 >> Max distance: 148148.15 pixels Tue Dec 13 16:46:45 2016 >> 171991 pairs generated. 0 skipped. Tue Dec 13 16:46:45 2016 >> Number of unknowns: 34167982 Tue Dec 13 16:47:27 2016 >> Solving pair 0 (1 of 171991): 422[3926,2380] to 564[5069,8782]. 1755.87 Km apart Tue Dec 13 17:15:55 2016 >> R_eff = 422,564,79.544963 Tue Dec 13 17:15:55 2016 >> Estimated time remaining: 81588:16:19 Tue Dec 13 17:15:55 2016 >> Solving pair 1 (2 of 171991): 305[3192,6667] to 524[4510,1132]. 1536.23 Km apart Tue Dec 13 17:18:28 2016 >> Solution to iteration 0 discarded. Tue Dec 13 17:18:40 2016 >> convergence-factor = 0.000000e+00 (0-N)

it took almost 30 minutes to solve the first pair, is that expected?

thanks in advance

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Pbleonard_GFlow_issues_11-23issuecomment-2D266805956&d=CwMFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=JFNhIwKlMoiVOcCCQu5eBCOCKlJciwyZGlNhbQn1DPk&m=CRQNKIaOI9RADY0Hh5WZ6bwtZN5TvfXYo1-arECQcXE&s=mY135DmKrYzTGMmue5xL4q_0rX_lyT6HE_ou2qyPnjw&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ASoaTNEbF3JPZ2JzDWjxrIpDbGWRMJhiks5rHtchgaJpZM4LLuER&d=CwMFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=JFNhIwKlMoiVOcCCQu5eBCOCKlJciwyZGlNhbQn1DPk&m=CRQNKIaOI9RADY0Hh5WZ6bwtZN5TvfXYo1-arECQcXE&s=HuADSR74Xx5mJnWk5Ztk4OwhldVrvOnS7Yps1n4fOZc&e= .

-- Sent from Gmail Mobile

eduffy commented 7 years ago

Hi @gioman - When I use all 4 cores on my MacBook it takes about 20 to 25 minutes per iteration on the sample data, so your runtime is not unexpected. If you run sysctl -n hw.ncpu, what's the output? This is number of CPUs on your system.

xgirouxb commented 7 years ago

Hello all, really excited about this software since the paper came out. Decided to take it for a test drive yesterday on ubuntu 16.10 after you updated the linux instructions. I get the same error message when I launch the example on 2 cores (4 virtual cores) with 10gb ram:

"mpiexec noticed that process rank 1 with PID 5374 on node ubuntu exited on signal 9 (Killed)"

Admittedly it's not the ideal rig for such a large resistance grid, but thought I should share in case this is a memory issue.

pbleonard commented 7 years ago

@xgirouxb I think the example inputs are rather large for that hardware. I have committed a new release with smaller inputs. Please use this for testing on desktop.

gioman commented 7 years ago

@Pbleonard Hi. We had finally access to a Ubuntu 16.10 machine with 12 cores and 32GB of RAM and tried to run gflow with a dataset of ours that is a cost map of (rows,cols) = (2225,2727) 50 meters resolution and 210 nodes:

gio@gio:~/GFlow$ sh ./execute_example.sh 
/usr/bin/mpiexec
sex dez 16 12:30:33 WET 2016
Fri Dec 16 12:30:33 2016 >> Effective resistance will be written to ./R_eff.csv.
Fri Dec 16 12:30:33 2016 >> Simulation will converge at 0.9
Fri Dec 16 12:30:33 2016 >> (rows,cols) = (2225,2727)
Fri Dec 16 12:30:35 2016 >> Removed 0 islands (0 cells).
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.7.3, Jul, 24, 2016 
[0]PETSC ERROR: ./gflow.x on a x86_64-linux-gnu-real named gio by gio Fri Dec 16 12:30:33 2016
[0]PETSC ERROR: Configure options --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --with-silent-rules=0 --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --with-maintainer-mode=0 --with-dependency-tracking=0 --with-debugging=0 --shared-library-extension=_real --with-hypre=1 --with-hypre-dir=/usr --with-clanguage=C++ --with-shared-libraries --with-pic=1 --useThreads=0 --with-fortran-interfaces=1 --with-mpi-dir=/usr/lib/openmpi --with-blas-lib=-lblas --with-lapack-lib=-llapack --with-blacs=1 --with-blacs-lib="-lblacsCinit-openmpi -lblacs-openmpi" --with-scalapack=1 --with-scalapack-lib=-lscalapack-openmpi --with-mumps=1 --with-mumps-include="[]" --with-mumps-lib="-ldmumps -lzmumps -lsmumps -lcmumps -lmumps_common -lpord" --with-suitesparse=1 --with-suitesparse-include=/usr/include/suitesparse --with-suitesparse-lib="-lumfpack -lamd -lcholmod -lklu" --with-spooles=1 --with-spooles-include=/usr/include/spooles --with-spooles-lib=-lspooles --with-ptscotch=1 --with-ptscotch-include=/usr/include/scotch --with-ptscotch-lib="-lptesmumps -lptscotch -lptscotcherr" --with-fftw=1 --with-fftw-include="[]" --with-fftw-lib="-lfftw3 -lfftw3_mpi" --with-superlu=0 --CXX_LINKER_FLAGS=-Wl,--no-as-needed --prefix=/usr/lib/petscdir/3.7.3/x86_64-linux-gnu-real PETSC_DIR=/build/petsc-fA70UI/petsc-3.7.3.dfsg1 --PETSC_ARCH=x86_64-linux-gnu-real CFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-fA70UI/petsc-3.7.3.dfsg1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC" CXXFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-fA70UI/petsc-3.7.3.dfsg1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC" FCFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-fA70UI/petsc-3.7.3.dfsg1=. -fstack-protector-strong -fPIC" FFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-fA70UI/petsc-3.7.3.dfsg1=. -fstack-protector-strong -fPIC" CPPFLAGS="-Wdate-time -D_FORTIFY_SOURCE=2" LDFLAGS="-Wl,-Bsymbolic-functions -Wl,-z,relro -fPIC" MAKEFLAGS=w
[0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

any hint would be appreciated. With regards.

gioman commented 7 years ago

@Pbleonard got a crash on the same machine also with your new smaller dataset. It seems the nodes coordinates are not overlapping the cost map

gio@gio:~/GFlow$ sh ./execute_example.sh 
/usr/bin/mpiexec
sex dez 16 12:45:04 WET 2016
Fri Dec 16 12:45:04 2016 >> Effective resistance will be written to ./R_eff.csv.
Fri Dec 16 12:45:04 2016 >> Simulation will converge at 0.9
Fri Dec 16 12:45:04 2016 >> (rows,cols) = (2845,2483)
Fri Dec 16 12:45:06 2016 >> Removed 1 islands (9829 cells).
Fri Dec 16 12:45:06 2016 >> 587 points in nodes
Point #1 (311,9200) is out of range. (9200 >= 2483)
Point #2 (314,9297) is out of range. (9297 >= 2483)
Point #3 (399,8641) is out of range. (8641 >= 2483)
Point #4 (420,8996) is out of range. (8996 >= 2483)
Point #5 (477,8522) is out of range. (8522 >= 2483)
Point #5 (477,8522) is invalid.
Point #6 (495,8855) is out of range. (8855 >= 2483)
Point #7 (517,8841) is out of range. (8841 >= 2483)
Point #8 (577,8967) is out of range. (8967 >= 2483)
Point #9 (638,9327) is out of range. (9327 >= 2483)
Point #10 (680,8375) is out of range. (8375 >= 2483)
Point #10 (680,8375) is invalid.
Point #11 (722,9083) is out of range. (9083 >= 2483)
Point #12 (731,8878) is out of range. (8878 >= 2483)
Point #13 (746,8600) is out of range. (8600 >= 2483)
Point #14 (769,9290) is out of range. (9290 >= 2483)
Point #15 (791,8291) is out of range. (8291 >= 2483)
Point #15 (791,8291) is invalid.
Point #16 (794,8483) is out of range. (8483 >= 2483)
Point #17 (824,9045) is out of range. (9045 >= 2483)
Point #18 (839,9687) is out of range. (9687 >= 2483)
Point #19 (843,9533) is out of range. (9533 >= 2483)
Point #20 (856,8210) is out of range. (8210 >= 2483)
Point #20 (856,8210) is invalid.
Point #21 (860,9718) is out of range. (9718 >= 2483)
Point #21 (860,9718) is invalid.
Point #22 (865,8124) is out of range. (8124 >= 2483)
Point #22 (865,8124) is invalid.
Point #23 (884,9223) is out of range. (9223 >= 2483)
Point #24 (934,8688) is out of range. (8688 >= 2483)
Point #25 (944,8997) is out of range. (8997 >= 2483)
Point #26 (982,7472) is out of range. (7472 >= 2483)
Point #26 (982,7472) is invalid.
Point #27 (994,8470) is out of range. (8470 >= 2483)
Point #28 (1027,9459) is out of range. (9459 >= 2483)
Point #28 (1027,9459) is invalid.
Point #29 (1034,9779) is out of range. (9779 >= 2483)
Point #29 (1034,9779) is invalid.
Point #30 (1050,7731) is out of range. (7731 >= 2483)
Point #30 (1050,7731) is invalid.
Point #31 (1066,8783) is out of range. (8783 >= 2483)
Point #32 (1086,9282) is out of range. (9282 >= 2483)
Point #33 (1095,9114) is out of range. (9114 >= 2483)
Point #34 (1102,8414) is out of range. (8414 >= 2483)
Point #35 (1117,5063) is out of range. (5063 >= 2483)
Point #35 (1117,5063) is invalid.
Point #36 (1123,8522) is out of range. (8522 >= 2483)
Point #37 (1136,8101) is out of range. (8101 >= 2483)
Point #38 (1139,9107) is out of range. (9107 >= 2483)
Point #39 (1148,8955) is out of range. (8955 >= 2483)
Point #40 (1155,9824) is out of range. (9824 >= 2483)
Point #40 (1155,9824) is invalid.
Point #41 (1170,8355) is out of range. (8355 >= 2483)
Point #42 (1185,8601) is out of range. (8601 >= 2483)
Point #43 (1185,9474) is out of range. (9474 >= 2483)
Point #43 (1185,9474) is invalid.
Point #44 (1203,9311) is out of range. (9311 >= 2483)
Point #45 (1209,7847) is out of range. (7847 >= 2483)
Point #46 (1213,9009) is out of range. (9009 >= 2483)
Point #47 (1242,7815) is out of range. (7815 >= 2483)
Point #48 (1246,9529) is out of range. (9529 >= 2483)
Point #48 (1246,9529) is invalid.
Point #49 (1261,7668) is out of range. (7668 >= 2483)
Point #49 (1261,7668) is invalid.
Point #50 (1267,6806) is out of range. (6806 >= 2483)
Point #50 (1267,6806) is invalid.
Point #51 (1276,8556) is out of range. (8556 >= 2483)
Point #52 (1292,8060) is out of range. (8060 >= 2483)
Point #53 (1305,9427) is out of range. (9427 >= 2483)
Point #53 (1305,9427) is invalid.
Point #54 (1311,8711) is out of range. (8711 >= 2483)
Point #55 (1333,4673) is out of range. (4673 >= 2483)
Point #55 (1333,4673) is invalid.
Point #56 (1338,6975) is out of range. (6975 >= 2483)
Point #56 (1338,6975) is invalid.
Point #57 (1359,5124) is out of range. (5124 >= 2483)
Point #57 (1359,5124) is invalid.
Point #58 (1365,9790) is out of range. (9790 >= 2483)
Point #58 (1365,9790) is invalid.
Point #59 (1405,9450) is out of range. (9450 >= 2483)
Point #59 (1405,9450) is invalid.
Point #60 (1408,9286) is out of range. (9286 >= 2483)
Point #60 (1408,9286) is invalid.
Point #61 (1409,8255) is out of range. (8255 >= 2483)
Point #62 (1430,5004) is out of range. (5004 >= 2483)
Point #62 (1430,5004) is invalid.
Point #63 (1431,4505) is out of range. (4505 >= 2483)
Point #63 (1431,4505) is invalid.
Point #64 (1432,10076) is out of range. (10076 >= 2483)
Point #64 (1432,10076) is invalid.
Point #65 (1454,7998) is out of range. (7998 >= 2483)
Point #66 (1468,9920) is out of range. (9920 >= 2483)
Point #66 (1468,9920) is invalid.
Point #67 (1475,6717) is out of range. (6717 >= 2483)
Point #67 (1475,6717) is invalid.
Point #68 (1478,7732) is out of range. (7732 >= 2483)
Point #69 (1478,8597) is out of range. (8597 >= 2483)
Point #70 (1489,7544) is out of range. (7544 >= 2483)
Point #70 (1489,7544) is invalid.
Point #71 (1497,7259) is out of range. (7259 >= 2483)
Point #71 (1497,7259) is invalid.
Point #72 (1500,9199) is out of range. (9199 >= 2483)
Point #72 (1500,9199) is invalid.
Point #73 (1508,7090) is out of range. (7090 >= 2483)
Point #73 (1508,7090) is invalid.
Point #74 (1517,8707) is out of range. (8707 >= 2483)
Point #75 (1523,8188) is out of range. (8188 >= 2483)
Point #76 (1530,8448) is out of range. (8448 >= 2483)
Point #77 (1539,4714) is out of range. (4714 >= 2483)
Point #77 (1539,4714) is invalid.
Point #78 (1539,9979) is out of range. (9979 >= 2483)
Point #78 (1539,9979) is invalid.
Point #79 (1547,8876) is out of range. (8876 >= 2483)
Point #80 (1550,4433) is out of range. (4433 >= 2483)
Point #80 (1550,4433) is invalid.
Point #81 (1579,8315) is out of range. (8315 >= 2483)
Point #82 (1587,7327) is out of range. (7327 >= 2483)
Point #82 (1587,7327) is invalid.
Point #83 (1598,4964) is out of range. (4964 >= 2483)
Point #83 (1598,4964) is invalid.
Point #84 (1606,9349) is out of range. (9349 >= 2483)
Point #84 (1606,9349) is invalid.
Point #85 (1618,7753) is out of range. (7753 >= 2483)
Point #86 (1627,9972) is out of range. (9972 >= 2483)
Point #86 (1627,9972) is invalid.
Point #87 (1630,6787) is out of range. (6787 >= 2483)
Point #87 (1630,6787) is invalid.
Point #88 (1638,9609) is out of range. (9609 >= 2483)
Point #88 (1638,9609) is invalid.
Point #89 (1639,5302) is out of range. (5302 >= 2483)
Point #90 (1650,8076) is out of range. (8076 >= 2483)
Point #91 (1661,6358) is out of range. (6358 >= 2483)
Point #92 (1662,7911) is out of range. (7911 >= 2483)
Point #93 (1675,9746) is out of range. (9746 >= 2483)
Point #93 (1675,9746) is invalid.
Point #94 (1693,8397) is out of range. (8397 >= 2483)
Point #95 (1697,9912) is out of range. (9912 >= 2483)
Point #95 (1697,9912) is invalid.
Point #96 (1701,8848) is out of range. (8848 >= 2483)
Point #97 (1704,6676) is out of range. (6676 >= 2483)
Point #97 (1704,6676) is invalid.
Point #98 (1711,10332) is out of range. (10332 >= 2483)
Point #99 (1715,9987) is out of range. (9987 >= 2483)
Point #99 (1715,9987) is invalid.
Point #100 (1720,6526) is out of range. (6526 >= 2483)
Point #100 (1720,6526) is invalid.
Point #101 (1730,7803) is out of range. (7803 >= 2483)
Point #102 (1732,8396) is out of range. (8396 >= 2483)
Point #103 (1735,7141) is out of range. (7141 >= 2483)
Point #103 (1735,7141) is invalid.
Point #104 (1737,6432) is out of range. (6432 >= 2483)
Point #104 (1737,6432) is invalid.
Point #105 (1750,10336) is out of range. (10336 >= 2483)
Point #106 (1769,7977) is out of range. (7977 >= 2483)
Point #107 (1770,5588) is out of range. (5588 >= 2483)
Point #108 (1782,8274) is out of range. (8274 >= 2483)
Point #109 (1794,8547) is out of range. (8547 >= 2483)
Point #110 (1806,10002) is out of range. (10002 >= 2483)
Point #110 (1806,10002) is invalid.
Point #111 (1820,8502) is out of range. (8502 >= 2483)
Point #112 (1825,1619) is invalid.
Point #113 (1827,7658) is out of range. (7658 >= 2483)
Point #114 (1829,6289) is out of range. (6289 >= 2483)
Point #115 (1830,7413) is out of range. (7413 >= 2483)
Point #115 (1830,7413) is invalid.
Point #116 (1867,7587) is out of range. (7587 >= 2483)
Point #116 (1867,7587) is invalid.
Point #117 (1890,10280) is out of range. (10280 >= 2483)
Point #118 (1893,8468) is out of range. (8468 >= 2483)
Point #119 (1917,4974) is out of range. (4974 >= 2483)
Point #119 (1917,4974) is invalid.
Point #120 (1919,6409) is out of range. (6409 >= 2483)
Point #120 (1919,6409) is invalid.
Point #121 (1926,7074) is out of range. (7074 >= 2483)
Point #121 (1926,7074) is invalid.
Point #122 (1928,8269) is out of range. (8269 >= 2483)
Point #123 (1934,5271) is out of range. (5271 >= 2483)
Point #124 (1963,3588) is out of range. (3588 >= 2483)
Point #125 (1967,10104) is out of range. (10104 >= 2483)
Point #125 (1967,10104) is invalid.
Point #126 (1975,8185) is out of range. (8185 >= 2483)
Point #127 (1981,7266) is out of range. (7266 >= 2483)
Point #127 (1981,7266) is invalid.
Point #128 (1984,7522) is out of range. (7522 >= 2483)
Point #128 (1984,7522) is invalid.
Point #129 (1988,5306) is out of range. (5306 >= 2483)
Point #130 (1999,8093) is out of range. (8093 >= 2483)
Point #131 (2000,8369) is out of range. (8369 >= 2483)
Point #132 (2023,6129) is out of range. (6129 >= 2483)
Point #133 (2024,4464) is out of range. (4464 >= 2483)
Point #133 (2024,4464) is invalid.
Point #134 (2038,4716) is out of range. (4716 >= 2483)
Point #134 (2038,4716) is invalid.
Point #135 (2045,6372) is out of range. (6372 >= 2483)
Point #135 (2045,6372) is invalid.
Point #136 (2051,4311) is out of range. (4311 >= 2483)
Point #136 (2051,4311) is invalid.
Point #137 (2065,8232) is out of range. (8232 >= 2483)
Point #138 (2075,5075) is out of range. (5075 >= 2483)
Point #138 (2075,5075) is invalid.
Point #139 (2082,9011) is out of range. (9011 >= 2483)
Point #139 (2082,9011) is invalid.
Point #140 (2088,8023) is out of range. (8023 >= 2483)
Point #141 (2089,9160) is out of range. (9160 >= 2483)
Point #141 (2089,9160) is invalid.
Point #142 (2093,6481) is out of range. (6481 >= 2483)
Point #142 (2093,6481) is invalid.
Point #143 (2094,7223) is out of range. (7223 >= 2483)
Point #143 (2094,7223) is invalid.
Point #144 (2095,8239) is out of range. (8239 >= 2483)
Point #145 (2124,7900) is out of range. (7900 >= 2483)
Point #146 (2125,8443) is out of range. (8443 >= 2483)
Point #147 (2128,10004) is out of range. (10004 >= 2483)
Point #147 (2128,10004) is invalid.
Point #148 (2140,9905) is out of range. (9905 >= 2483)
Point #148 (2140,9905) is invalid.
Point #149 (2141,8390) is out of range. (8390 >= 2483)
Point #150 (2146,6246) is out of range. (6246 >= 2483)
Point #151 (2165,7944) is out of range. (7944 >= 2483)
Point #152 (2171,7011) is out of range. (7011 >= 2483)
Point #152 (2171,7011) is invalid.
Point #153 (2171,8269) is out of range. (8269 >= 2483)
Point #154 (2172,4051) is out of range. (4051 >= 2483)
Point #154 (2172,4051) is invalid.
Point #155 (2177,7374) is out of range. (7374 >= 2483)
Point #155 (2177,7374) is invalid.
Point #156 (2180,9282) is out of range. (9282 >= 2483)
Point #156 (2180,9282) is invalid.
Point #157 (2182,6460) is out of range. (6460 >= 2483)
Point #157 (2182,6460) is invalid.
Point #158 (2190,4476) is out of range. (4476 >= 2483)
Point #158 (2190,4476) is invalid.
Point #159 (2203,6396) is out of range. (6396 >= 2483)
Point #159 (2203,6396) is invalid.
Point #160 (2204,5219) is out of range. (5219 >= 2483)
Point #160 (2204,5219) is invalid.
Point #161 (2207,8257) is out of range. (8257 >= 2483)
Point #162 (2228,7801) is out of range. (7801 >= 2483)
Point #163 (2233,6266) is out of range. (6266 >= 2483)
Point #164 (2256,8108) is out of range. (8108 >= 2483)
Point #165 (2256,9219) is out of range. (9219 >= 2483)
Point #165 (2256,9219) is invalid.
Point #166 (2262,9492) is out of range. (9492 >= 2483)
Point #166 (2262,9492) is invalid.
Point #167 (2273,4935) is out of range. (4935 >= 2483)
Point #167 (2273,4935) is invalid.
Point #168 (2280,4099) is out of range. (4099 >= 2483)
Point #168 (2280,4099) is invalid.
Point #169 (2284,8047) is out of range. (8047 >= 2483)
Point #170 (2291,8027) is out of range. (8027 >= 2483)
Point #171 (2308,6367) is out of range. (6367 >= 2483)
Point #171 (2308,6367) is invalid.
Point #172 (2310,6463) is out of range. (6463 >= 2483)
Point #172 (2310,6463) is invalid.
Point #173 (2313,9693) is out of range. (9693 >= 2483)
Point #173 (2313,9693) is invalid.
Point #174 (2322,9826) is out of range. (9826 >= 2483)
Point #174 (2322,9826) is invalid.
Point #175 (2332,7147) is out of range. (7147 >= 2483)
Point #175 (2332,7147) is invalid.
Point #176 (2336,8981) is out of range. (8981 >= 2483)
Point #176 (2336,8981) is invalid.
Point #177 (2337,7290) is out of range. (7290 >= 2483)
Point #177 (2337,7290) is invalid.
Point #178 (2345,1940) is invalid.
Point #179 (2346,7677) is out of range. (7677 >= 2483)
Point #179 (2346,7677) is invalid.
Point #180 (2347,9629) is out of range. (9629 >= 2483)
Point #180 (2347,9629) is invalid.
Point #181 (2348,8170) is out of range. (8170 >= 2483)
Point #182 (2358,7843) is out of range. (7843 >= 2483)
Point #183 (2361,6451) is out of range. (6451 >= 2483)
Point #183 (2361,6451) is invalid.
Point #184 (2371,8548) is out of range. (8548 >= 2483)
Point #185 (2374,6566) is out of range. (6566 >= 2483)
Point #185 (2374,6566) is invalid.
Point #186 (2390,6372) is out of range. (6372 >= 2483)
Point #187 (2391,8110) is out of range. (8110 >= 2483)
Point #188 (2414,4248) is out of range. (4248 >= 2483)
Point #188 (2414,4248) is invalid.
Point #189 (2436,4696) is out of range. (4696 >= 2483)
Point #189 (2436,4696) is invalid.
Point #190 (2444,4096) is out of range. (4096 >= 2483)
Point #190 (2444,4096) is invalid.
Point #191 (2444,8995) is out of range. (8995 >= 2483)
Point #191 (2444,8995) is invalid.
Point #192 (2446,9088) is out of range. (9088 >= 2483)
Point #192 (2446,9088) is invalid.
Point #193 (2449,7873) is out of range. (7873 >= 2483)
Point #193 (2449,7873) is invalid.
Point #194 (2454,7149) is out of range. (7149 >= 2483)
Point #194 (2454,7149) is invalid.
Point #195 (2456,3533) is out of range. (3533 >= 2483)
Point #196 (2464,7976) is out of range. (7976 >= 2483)
Point #197 (2473,7726) is out of range. (7726 >= 2483)
Point #197 (2473,7726) is invalid.
Point #198 (2474,9400) is out of range. (9400 >= 2483)
Point #198 (2474,9400) is invalid.
Point #199 (2475,8588) is out of range. (8588 >= 2483)
Point #200 (2483,6201) is out of range. (6201 >= 2483)
Point #201 (2485,9165) is out of range. (9165 >= 2483)
Point #201 (2485,9165) is invalid.
Point #202 (2488,6104) is out of range. (6104 >= 2483)
Point #203 (2495,6490) is out of range. (6490 >= 2483)
Point #203 (2495,6490) is invalid.
Point #204 (2505,8923) is out of range. (8923 >= 2483)
Point #204 (2505,8923) is invalid.
Point #205 (2518,8818) is out of range. (8818 >= 2483)
Point #206 (2520,1634) is invalid.
Point #207 (2527,8686) is out of range. (8686 >= 2483)
Point #208 (2527,9604) is out of range. (9604 >= 2483)
Point #208 (2527,9604) is invalid.
Point #209 (2529,8497) is out of range. (8497 >= 2483)
Point #210 (2533,7807) is out of range. (7807 >= 2483)
Point #210 (2533,7807) is invalid.
Point #211 (2533,7966) is out of range. (7966 >= 2483)
Point #211 (2533,7966) is invalid.
Point #212 (2534,6518) is out of range. (6518 >= 2483)
Point #212 (2534,6518) is invalid.
Point #213 (2534,7529) is out of range. (7529 >= 2483)
Point #213 (2534,7529) is invalid.
Point #214 (2541,8638) is out of range. (8638 >= 2483)
Point #215 (2557,8514) is out of range. (8514 >= 2483)
Point #215 (2557,8514) is invalid.
Point #216 (2560,7034) is out of range. (7034 >= 2483)
Point #216 (2560,7034) is invalid.
Point #217 (2567,8979) is out of range. (8979 >= 2483)
Point #217 (2567,8979) is invalid.
Point #218 (2581,3753) is out of range. (3753 >= 2483)
Point #219 (2582,9388) is out of range. (9388 >= 2483)
Point #219 (2582,9388) is invalid.
Point #220 (2602,6135) is out of range. (6135 >= 2483)
Point #221 (2604,5204) is out of range. (5204 >= 2483)
Point #221 (2604,5204) is invalid.
Point #222 (2605,6365) is out of range. (6365 >= 2483)
Point #222 (2605,6365) is invalid.
Point #223 (2633,4336) is out of range. (4336 >= 2483)
Point #223 (2633,4336) is invalid.
Point #224 (2634,9284) is out of range. (9284 >= 2483)
Point #224 (2634,9284) is invalid.
Point #225 (2652,2200) is invalid.
Point #226 (2675,3314) is out of range. (3314 >= 2483)
Point #226 (2675,3314) is invalid.
Point #227 (2675,8471) is out of range. (8471 >= 2483)
Point #227 (2675,8471) is invalid.
Point #228 (2676,7241) is out of range. (7241 >= 2483)
Point #228 (2676,7241) is invalid.
Point #229 (2691,2400) is invalid.
Point #230 (2697,8889) is out of range. (8889 >= 2483)
Point #230 (2697,8889) is invalid.
Point #231 (2701,3668) is out of range. (3668 >= 2483)
Point #231 (2701,3668) is invalid.
Point #232 (2703,7356) is out of range. (7356 >= 2483)
Point #232 (2703,7356) is invalid.
Point #233 (2709,1842) is invalid.
Point #234 (2712,3953) is out of range. (3953 >= 2483)
Point #234 (2712,3953) is invalid.
Point #235 (2724,3519) is out of range. (3519 >= 2483)
Point #235 (2724,3519) is invalid.
Point #236 (2745,9381) is out of range. (9381 >= 2483)
Point #236 (2745,9381) is invalid.
Point #237 (2753,6358) is out of range. (6358 >= 2483)
Point #237 (2753,6358) is invalid.
Point #238 (2755,5979) is out of range. (5979 >= 2483)
Point #238 (2755,5979) is invalid.
Point #239 (2760,6399) is out of range. (6399 >= 2483)
Point #239 (2760,6399) is invalid.
Point #240 (2767,7545) is out of range. (7545 >= 2483)
Point #240 (2767,7545) is invalid.
Point #241 (2769,4712) is out of range. (4712 >= 2483)
Point #241 (2769,4712) is invalid.
Point #242 (2786,9131) is out of range. (9131 >= 2483)
Point #242 (2786,9131) is invalid.
Point #243 (2787,3710) is out of range. (3710 >= 2483)
Point #243 (2787,3710) is invalid.
Point #244 (2790,8936) is out of range. (8936 >= 2483)
Point #244 (2790,8936) is invalid.
Point #245 (2795,3976) is out of range. (3976 >= 2483)
Point #245 (2795,3976) is invalid.
Point #246 (2796,9406) is out of range. (9406 >= 2483)
Point #246 (2796,9406) is invalid.
Point #247 (2798,8302) is out of range. (8302 >= 2483)
Point #247 (2798,8302) is invalid.
Point #248 (2812,4105) is out of range. (4105 >= 2483)
Point #248 (2812,4105) is invalid.
Point #249 (2814,8862) is out of range. (8862 >= 2483)
Point #249 (2814,8862) is invalid.
Point #250 (2817,8427) is out of range. (8427 >= 2483)
Point #250 (2817,8427) is invalid.
Point #251 (2819,2802) is out of range. (2802 >= 2483)
Point #251 (2819,2802) is invalid.
Point #252 (2824,8395) is out of range. (8395 >= 2483)
Point #252 (2824,8395) is invalid.
Point #253 (2835,8439) is out of range. (8439 >= 2483)
Point #253 (2835,8439) is invalid.
Point #254 (2836,4200) is out of range. (4200 >= 2483)
Point #254 (2836,4200) is invalid.
Point #255 (2839,1975) is invalid.
Point #256 (2845,2251) is out of range. (2845 >= 2845)
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.7.3, Jul, 24, 2016 
[0]PETSC ERROR: ./gflow.x on a x86_64-linux-gnu-real named gio by gio Fri Dec 16 12:45:04 2016
[0]PETSC ERROR: Configure options --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --with-silent-rules=0 --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --with-maintainer-mode=0 --with-dependency-tracking=0 --with-debugging=0 --shared-library-extension=_real --with-hypre=1 --with-hypre-dir=/usr --with-clanguage=C++ --with-shared-libraries --with-pic=1 --useThreads=0 --with-fortran-interfaces=1 --with-mpi-dir=/usr/lib/openmpi --with-blas-lib=-lblas --with-lapack-lib=-llapack --with-blacs=1 --with-blacs-lib="-lblacsCinit-openmpi -lblacs-openmpi" --with-scalapack=1 --with-scalapack-lib=-lscalapack-openmpi --with-mumps=1 --with-mumps-include="[]" --with-mumps-lib="-ldmumps -lzmumps -lsmumps -lcmumps -lmumps_common -lpord" --with-suitesparse=1 --with-suitesparse-include=/usr/include/suitesparse --with-suitesparse-lib="-lumfpack -lamd -lcholmod -lklu" --with-spooles=1 --with-spooles-include=/usr/include/spooles --with-spooles-lib=-lspooles --with-ptscotch=1 --with-ptscotch-include=/usr/include/scotch --with-ptscotch-lib="-lptesmumps -lptscotch -lptscotcherr" --with-fftw=1 --with-fftw-include="[]" --with-fftw-lib="-lfftw3 -lfftw3_mpi" --with-superlu=0 --CXX_LINKER_FLAGS=-Wl,--no-as-needed --prefix=/usr/lib/petscdir/3.7.3/x86_64-linux-gnu-real PETSC_DIR=/build/petsc-fA70UI/petsc-3.7.3.dfsg1 --PETSC_ARCH=x86_64-linux-gnu-real CFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-fA70UI/petsc-3.7.3.dfsg1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC" CXXFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-fA70UI/petsc-3.7.3.dfsg1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC" FCFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-fA70UI/petsc-3.7.3.dfsg1=. -fstack-protector-strong -fPIC" FFLAGS="-g -O2 -fdebug-prefix-map=/build/petsc-fA70UI/petsc-3.7.3.dfsg1=. -fstack-protector-strong -fPIC" CPPFLAGS="-Wdate-time -D_FORTIFY_SOURCE=2" LDFLAGS="-Wl,-Bsymbolic-functions -Wl,-z,relro -fPIC" MAKEFLAGS=w
[0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
gioman commented 7 years ago

@Pbleonard tested also your original (big) dataset, and it works as expected. Computing time for each pair is around 45/50 seconds. Any hint about what could be the problem with our dataset? Maybe it isthe coordinate reference systems? Not sure what CRS is used in your test data (doesn't seems either 4326 or 3857).

pbleonard commented 7 years ago

@gioman thanks for bringing the small dataset to my attention. I know what the problem is there and will correct it shortly. As far as your inputs, I would tinker with the number of cores you're using (e.g., try dividing up your 32gb of RAM across 4 cores or some other combination until you can get a feel for about how much memory is required for each worker core and then scale up). Hope that helps. The test CRS is EPSG: 102003 but as long as your nodes are resist are the same it shouldn't matter.

pbleonard commented 7 years ago

@gioman small inputs in release updated

gioman commented 7 years ago

@Pbleonard

small inputs in release updated

thanks! it works as expected.

On the other hand I cannot get things going with our data. I tried to

lower the number of cpus lower the number of nodes lower the cost map extent lower the cost map resolution

the result is always the same, scripts starts and stops at

Fri Dec 16 19:59:18 2016 >> Removed 0 islands (0 cells)

the RAM consumption start to grow up until ~25GB, then

gio@gio:~/GFlow$ sh ./execute_example.sh 
/usr/bin/mpiexec
sex dez 16 12:30:33 WET 2016
Fri Dec 16 12:30:33 2016 >> Effective resistance will be written to ./R_eff.csv.
Fri Dec 16 12:30:33 2016 >> Simulation will converge at 0.9
Fri Dec 16 12:30:33 2016 >> (rows,cols) = (2225,2727)
Fri Dec 16 12:30:35 2016 >> Removed 0 islands (0 cells).
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.

I'm attaching here a small sample of our data, enough here to replicate the described behavior. If you could spot anything wrong it would be very helpful. Thanks in advance.

sample.tar.gz

pbleonard commented 7 years ago

@gioman I believe I understand your problem. The input nodes appear to be in lat/long or geographic coordinates. The software requires coordinates relative to the resistance surface. I can share a script with you to do this OR you can input an .asc grid to GFlow (as your nodes - similar to Circuitscape) and it will automatically convert.

gioman commented 7 years ago

@Pbleonard

The software requires coordinates relative to the resistance surface

ahh, that's an important detail :)

I'm getting mixed results using nodes as asc, with some subdatasets it works, more frequently I'm getting

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.

or

[0]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably divide by zero
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.

I'll appreciate if you could provide me the script to make further tests with nodes as "coordinates". With regards.

pbleonard commented 7 years ago

@gioman I updated the execute_example.sh to describe this. In the meantime please try this script: asc-to-nodelist.c.zip

Example:

  1. Compile: gcc asc-to-nodelist.c -o asc-to-nodelist.x
  2. Execute: ./asc-to-nodelist.x input/nodes.asc >> output/nodes
  3. Use output/nodes as input to GFlow

Not sure about your errors using input .asc though. Hopefully this will throw some light on that.

gioman commented 7 years ago

@gioman I updated the execute_example.sh to describe this. In the meantime please try this script: asc-to-nodelist.c.zip

@Pbleonard thanks a lot as usual for the quick feedback. Has the script any switch/flag to translate to relative coordinates only the non NULL pixels in the input node asc file? I'm getting "translated" to relative coordinates all the pixels, also NULLS (defined as -9999 here).

gioman commented 7 years ago

@Pbleonard Hi again,

I think I figured out the cause of most of the problems/errors: pixels must be perfectly square, and in resistance/nodes maps they must have the same size (I think also that at some point with GDAL I outputted a .asc file but that was really a GeoTiff...).

If using as input a subset of our dataset, clipping both the resistance and the nodes maps to a rectangle (resistance map 1131x735 at 50m resolution and nodes are 140), GFlow solves (pairwise) 3240 pairs in about 40 minutes(!) on a 12 cores machine.

screenshot from 2016-12-18 16-03-57

On the same 12 cores machine if using the full dataset (resistance map 2727x2225 at 50m resolution and nodes are 207), Gflow will try to solve 365940 pairs with an ETA of 10 days, which is still ~1/3 of the time Circuitscape took to process the same dataset on the same machine.

If Gflow would compute all those maps at 45mb each (the resistance map if 45mb) it would create 16TB of "temp" files.

Does any of the above numbers make sense to you? Just asking to understand if the observations here are in line with the expected ones for such datasets.

pbleonard commented 7 years ago

@gioman Good to hear you've got it working. A few things:

You are correct that inputs basically have the same requirements as Circuitscape unless you input a list of nodes directly (no grid is required and thus it does not matter about resolution etc.) just that GFlow knows where on the resistance grid a 'node' is located. What I mean to say is you could develop a script or tool that takes points in a GIS or GDAL and gives you the x,y coordinates relative to the resistance surface grid. Along that line, I was unclear what you meant about the script I sent. It should ignore all -9999 cells in the .asc (assuming .asc is formatted similarly for use in Circuitscape) and only give you a list of input locations on the grid.

So your large problem solves ~ 3 million unknowns at each iteration? (You can check the output log). The first example problem I uploaded was significantly larger solving ~34 million unknowns at each iteration but I typically use a minimum of 60 CPUs to solve these problems. The current example is smaller for folks to use with desktop computing.

I am confused about how you're calculating your pairwise solves. In your small problem you mention 140 nodes so all pairwise should be(n*n-1)/2 = 9,730 total solves and your large problem of 207 nodes should be again (207-206)/2 = 21,321 solves. Where were you getting the 365940 below?

(resistance map 2727x2225 at 50m resolution and nodes are 207), Gflow will try to solve 365940 pairs

You are now using CPU flag = 12 and this is being done with 16gb RAM? That sounds like it should speed up your solves by more than 2/3 from Circuitscape using only 1 core.

gioman commented 7 years ago

@Pbleonard Hi! good morning.

Along that line, I was unclear what you meant about the script I sent. It should ignore all -9999 cells in the .asc (assuming .asc is formatted similarly for use in Circuitscape) and only give you a list of input locations on the grid

there maybe something wrong in the .asc I'm using, the script is definitely not ignoring -9999 cells. Could you kindly add to the test dataset also the .asc version of the nodes, so I can have a look at it?

So your large problem solves ~ 3 million unknowns at each iteration? (You can check the output log). The first example problem I uploaded was significantly larger solving ~34 million unknowns at each iteration but I typically use a minimum of 60 CPUs to solve these problems. The current example is smaller for folks to use with desktop computing.

I am confused about how you're calculating your pairwise solves. In your small problem you mention 140 nodes so all pairwise should be(n*n-1)/2 = 9,730 total solves and your large problem of 207 nodes should be again (207-206)/2 = 21,321 solves. Where were you getting the 365940 below?

I didn't really calculated those numbers, just reported what the output shows:

this is what the output looks like for our "small" dataset, that is ~1/3 (in area) of our "full" resistance map and ~1/2 the number of nodes:

...
Mon Dec 19 10:16:36 2016 >> Solving pair 502 (503 of 3240): 30[313,59] to 58[464,86].    7.67 Km apart
Mon Dec 19 10:16:37 2016 >> Result /media/gio/DADOS/test_small/local_000501.asc written.
Mon Dec 19 10:16:37 2016 >> convergence-factor = 9.999869e-01 (4-N)
Mon Dec 19 10:16:37 2016 >> R_eff = 30,58,281.050873
Mon Dec 19 10:16:37 2016 >> Estimated time remaining: 00:36:36
Mon Dec 19 10:16:37 2016 >> Solving pair 503 (504 of 3240): 50[425,405] to 74[617,233].   12.89 Km apart
Mon Dec 19 10:16:37 2016 >> Result /media/gio/DADOS/test_small/local_000502.asc written.
Mon Dec 19 10:16:37 2016 >> convergence-factor = 9.999865e-01 (4-N)
Mon Dec 19 10:16:37 2016 >> R_eff = 50,74,315.556901
Mon Dec 19 10:16:37 2016 >> Estimated time remaining: 00:36:35
Mon Dec 19 10:16:37 2016 >> Solving pair 504 (505 of 3240): 34[326,187] to 45[395,507].   16.37 Km apart
...

this is what the output looks like for our full dataset

...
Mon Dec 19 10:40:15 2016 >> Solving pair 192 (193 of 365940): 717[1286,494] to 833[1602,1130].   35.51 Km apart
Mon Dec 19 10:40:18 2016 >> Result /media/gio/DADOS/test_full/local_000191.asc written.
Mon Dec 19 10:40:18 2016 >> convergence-factor = 9.997431e-01 (3-N)
Mon Dec 19 10:40:18 2016 >> R_eff = 717,833,594.121162
Mon Dec 19 10:40:18 2016 >> Estimated time remaining: 212:20:15
Mon Dec 19 10:40:18 2016 >> Node (570,738) has zero resistance (most likely).
Mon Dec 19 10:40:18 2016 >> Solving pair 194 (195 of 365940): 536[989,1177] to 656[1180,101].   54.64 Km apart
Mon Dec 19 10:40:21 2016 >> Result /media/gio/DADOS/test_full/local_000193.asc written.
Mon Dec 19 10:40:21 2016 >> convergence-factor = 9.997197e-01 (3-N)
Mon Dec 19 10:40:21 2016 >> R_eff = 536,656,921.382478
Mon Dec 19 10:40:21 2016 >> Estimated time remaining: 211:47:00
Mon Dec 19 10:40:21 2016 >> Solving pair 195 (196 of 365940): 128[479,2223] to 371[825,870].   69.83 Km apart
...

When processing the full dataset at the start we are getting also a number of

...
Point #653 (1180,48) is invalid.
Point #654 (1180,49) is invalid.
Point #657 (1181,48) is invalid.
Point #658 (1181,49) is invalid.
...

but all the nodes are overlapped on the resistance map. What could be the problem with them?

You are now using CPU flag = 12 and this is being done with 16gb RAM? That sounds like it should speed up your solves by more than 2/3 from Circuitscape using only 1 core.

For the "small" dataset computation time (~40 minutes) is very fast, while I'm not sure why for the "full" dataset computation time is much higher in proportion (the machine we are using has 32Gb of RAM).

gioman commented 7 years ago

@Pbleonard disregard part of my previous comment, there was something odd in our .asc nodes file. We recreated it and now running gflow on our full dataset:

...
Mon Dec 19 11:29:01 2016 >> Solving pair 18 (19 of 23653): 13[513,1157] to 85[1009,559].   38.85 Km apart
Mon Dec 19 11:29:05 2016 >> Result /media/gio/DADOS/test_full/local_000017.asc written.
Mon Dec 19 11:29:05 2016 >> convergence-factor = 9.896743e-01 (1-N)
Mon Dec 19 11:29:05 2016 >> R_eff = 13,85,222.656701
Mon Dec 19 11:29:05 2016 >> Estimated time remaining: 23:05:20
Mon Dec 19 11:29:05 2016 >> Solving pair 19 (20 of 23653): 187[1528,398] to 217[1994,921].   35.02 Km apart
...
pbleonard commented 7 years ago

@gioman Thats great news. Looks like less than 24 hours there for your big solve? Just in case its still relavent.... here is a sample .asc for which the script should work nodes.asc.zip

gioman commented 7 years ago

@Pbleonard

Thats great news. Looks like less than 24 hours there for your big solve?

yes! this is really great news! 24 hours against 30 days is really a huge improvement, many thanks for you work.

just in case its still relavent.... here is a sample .asc for which the script should work nodes.asc.zip

thanks, it is useful indeed as reference. It seems that with the tools I use (gdal rasterize/translate in this case) not always is returned an .asc file formatted as gflow/cs expects.

cheers!