gflow / GFlow

Software for modeling circuit theory-based connectivity
GNU General Public License v3.0
22 stars 5 forks source link

Error message: invalid points #15

Open RMarrec opened 7 years ago

RMarrec commented 7 years ago

Hi, I am running GFlow using different resistance and node data, as well as different spatial extent (20x20km to Alberta extent). In the example presented below, I want to calculate cost-weighted distance between all pairs of nodes.

Even though coordinates given in my node file belong to the map extent, I sometimes face this error message:

Mon Apr 10 11:52:12 2017 >> Effective resistance will be written to ./Zone1/R_eff_Forest_Zone1.csv. Mon Apr 10 11:52:12 2017 >> (rows,cols) = (293,293) Mon Apr 10 11:52:12 2017 >> Removed -1 islands (0 cells). Mon Apr 10 11:52:12 2017 >> 226 points in nodes_Forest_Zone1 Point #1 (289,231) is invalid. Point #2 (288,258) is invalid. ... Point #225 (14,210) is invalid. Point #226 (9,149) is invalid. Input file nodes_Forest_Zone1 contains invalid points.

In this case, all points seem invalid. Do you know why? Is it due to the point locations in themselves or to the resistance data?

I join the resistance and node files corresponding to this example. Example_ForestZone1.zip

Thank you, Ronan

eduffy commented 7 years ago

Hi Ronan - My guess is the problem is because your resistance file is using -1.#INF for NODATA_value. The parser is expecting that all the resistance values to be real numbers. Is it possible for you to change that to -9999?

eduffy commented 7 years ago

To quickly change the file, you can run this command:

 sed -e 's/1.#INF/9999/g' -i resistance_Zone1.asc
mairindeith commented 7 years ago

Hi all - I'm having the same issue with my GFlow setup, even when using 9999 as my NODATA_value. Did this solution resolve your problems Ronan?

Here is the error I'm receiving after running GFlow:

Sat Jun 10 10:55:11 2017 >> Simulation will converge at 0.99
Sat Jun 10 10:55:11 2017 >> (rows,cols) = (3915,5839)
Sat Jun 10 10:55:14 2017 >> Removed 215 islands (36459 cells).
Point #1 (1054,3332) is invalid.
Point #2 (1041,3322) is invalid.
Point #3 (1042,3322) is invalid.
Point #4 (1041,3323) is invalid.
Point #5 (1042,3323) is invalid.
...
Point #83 (1053,3338) is invalid.
Input file `../Nodes/3/NodesTXT/nodes_5.0210838_3.txt` contains invalid points.
Sat Jun 10 10:55:16 2017 >> Node (1054,3332) has zero resistance (most likely).
Sat Jun 10 10:55:16 2017 >> Node (1054,3332) has zero resistance (most likely).
Sat Jun 10 10:55:16 2017 >> Node (1054,3332) has zero resistance (most likely).
...

I'm running Ubuntu 17.04, and have attached my .txt, .tsv, and resistance map file below: [link removed.]

Thanks! Mairin

eduffy commented 7 years ago

Hi @mairindeith - It looks like the x,y positions in your nodes.txt file are transposed. Try running this to swap the columns:

 awk -e '{print $2, $1}' nodes.txt >nodes2.txt
mairindeith commented 7 years ago

@eduffy - you're absolutely right, problem solved! Thank you for the quick reply and the quicker bash command to fix it!

eduffy commented 7 years ago

@RMarrec - Did changing the NODATA_value value fix the problem?

RMarrec commented 7 years ago

Hi @eduffy - I am very sorry for this long delay before I finally answer you. I tested today what you proposed, and changed "-1#INF" NoData values to -9999 and it works! Thank you so much! NB- I just replaced values in the original .asc file and did not make this change in a GIS software as I did not find an easy way to do it. I do not know if this "dirty" way of proceed might change anything? Sorry again for the delay, and thank you for your help! Ronan.

eduffy commented 7 years ago

Hi @RMarrec - It looks like your resistance file has a lot of 0's not -9999's. These are short-circuits, not open-circuits. Try changing the zeros to NODATA like this:

 sed -e 's/\b0[ \n]/-9999 /g' -i w_1065_energy.asc
pbleonard commented 7 years ago

@RMarrec - Also, for the sake of others; node x,y locations cannot be zero as no row or column begins with zero. As far as scaling resistance goes; resistances that you might treat as0 or 1s should be -9999 unless in a special case 1 has a meaning different from -9999 and you would be using both values. And finally, check your headers of resistance files to make sure the case is exactly the same as the example inputs. This issue has been corrected in @eduffy latest commit.

RMarrec commented 7 years ago

Thank you all for your answers. @eduffy - These 0s are "real" 0s, and do not belong to pixels out of the studied, mapped area. In my case I use a range from 0 to 1 based on the degree of human modification. If the location is unmodified, it will have a value of 0, if it is highly modified it will have a value of 1. @eduffy @Pbleonard - As I understand the no data value (-9999), it applies to pixels where you do not want the current to flow through. Am I right? In this case, I cannot apply -9999 values to all 0 pixels... In addition, in some landscape windows I study there are a lot of 0s as we study a gradient of landscape modification. @Pbleonard - I agree with for the 0s in the node file, it was a mistake during the transposition process from actual coordinates to pixel coordinates. I solved this issue. The same way, I modified the case in the resistance file. This comes from R which writes all characters in uppercase by default.

pbleonard commented 7 years ago

@RMarrec - You'll notice in Dickson et al. 2016 they rescale resistance R : R = (H + 1)10 + s/4 where H = HMI and s = slope. I would also recommend Belote et al. 2016 for a discussion on scaling resistance values as they also worked with HMI in a linear and nonlinear fashion. In short, it is ok to use 1 as a meaningful value of resistance but I would not use 0 as that creates short circuits as @eduffy points out. You would still use -9999 for infinite resistance (no data) or complete barriers.

RMarrec commented 7 years ago

@Pbleonard - I use the Dickson's equation to rescale resistance as well. But I did not know it was to avoid 0s because it also affects the distribution of resistance values. However, in circuitscape, 0s are not allowed as resistance values. That is probably why in studies resistance generally ranges from 1 to 10, 100, or 1000.

HossamGhub commented 6 years ago

Hi Paul and Edward, @pbleonard @eduffy

First, thanks much for the great software! I am continuing Ronan's @RMarrec work on Alberta's resistance maps. I have no problems when I work with 100m resolutions maps. Once I start my Gflow runs on 30m and 10m resolutions I run into two main issues. First, the invalid nodes "points" that Ronan faced, and secondly, I think, a computing performance issue with the 10m resolution maps. Could you please advice how to fix the invalid points issue (I tried all of the above recommendations and in a close contact with Ronan) and how to overcome the computation capacity issue if possible!? It would be great as well if you please could explain how to compile Gflow on a cluster that is using SLURM scheduler.

Thank you so much in advance,

Hossam ———————————————— Hossam Abdel Moniem, Ph.D. Postdoctoral Research Associate Department of Biology University of Toronto – Mississauga (UTM)

eduffy commented 6 years ago

Hi Hossam -

Glad you find this program useful!

  1. Can you send us a link to a resistance map that you're getting the invalid nodes error on.
  2. Fixing your performance issue is tricky without access to the system you're on. How many unknowns are you trying to solve? How many CPUs?
  3. Every cluster is set up differently. Many use a command called "modules" to setup your environment (such as choosing the right compiler, and proper MPI libraries). The admins of your cluster should have these things documented .. or better yet, offer some training on getting you started. The execute_example.sh in our repository would be good place to start; running a job on a cluster requires a script with two parts: (a) a list of hardware resources you require to run your program and (b) the shell commands required to actually run the program. We can help with the second part, but the first part is specific to the cluster you're running on.
HossamGhub commented 6 years ago

Hi Edward,

Thank you for your reply, much appreciated! Please have a look at the screenshot for the message I get when I run Gflow on the 10m resolution map. I am still trying with the 30m resolution map and I will know tomorrow. Also I forwarded your message to out cluster admin here at UTM to see what he can do.

Best,

Hossam

On May 8, 2018, at 7:05 PM, Edward Duffy notifications@github.com wrote:

Hi Hossam -

Glad you find this program useful!

Can you send us a link to a resistance map that you're getting the invalid nodes error on. Fixing your performance issue is tricky without access to the system you're on. How many unknowns are you trying to solve? How many CPUs? Every cluster is set up differently. Many use a command called "modules" to setup your environment (such as choosing the right compiler, and proper MPI libraries). The admins of your cluster should have these things documented .. or better yet, offer some training on getting you started. The execute_example.sh in our repository would be good place to start; running a job on a cluster requires a script with two parts: (a) a list of hardware resources you require to run your program and (b) the shell commands required to actually run the program. We can help with the second part, but the first part is specific to the cluster you're running on. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gflow/GFlow/issues/15#issuecomment-387569782, or mute the thread https://github.com/notifications/unsubscribe-auth/Af4IxKeyxtfEQsq4uUMgOiTa6nd0mdfSks5twiSogaJpZM4M5ALb.

HossamGhub commented 6 years ago

Hi Edward,

I am trying to use Gflow on computecanada (Niagara HPC) to compute connectivity using a 10m resolution resistance map for the extent of Alberta, Canada. After I successfully compiled and installed Gflow on Niagara, I started with a computation that I know for sure works (working with the 100m resolution). However, I ran into different memory problems. I needed to modify the bash script to have both the instructions for the cluster to allocate resources and the Gflow execution commands. Still having problems! I am attaching the most recent code I am using. Please let me know if I am doing anything wrong!

#!/bin/bash 
#SBATCH --nodes=4
#SBATCH --ntasks=160
#SBATCH --time=2:00:00
#SBATCH --job-name U_100_Nia7
#SBATCH --mail-type=ALL
#SBATCH --mail-user=hossam.hafez@utoronto.ca
#SBATCH --output=mpi_U_100_Nia7.out

cd $SLURM_SUBMIT_DIR

module load intel/2018.2  intelmpi/2018.2
module load petsc/3.8.4

which mpiexec

export PETSC_DIR=${SCINET_PETSC_ROOT}
export LD_LIBRARY_PATH=${PETSC_DIR}/lib:$LD_LIBRARY_PATH

OUTPUT_DIR=.

SECONDS=0
date

mpiexec ./gflow.x \
    -habitat r_u_100.asc \
    -nodes nodesYX_100m_MODIF \
    -converge_at 3N \
    -shuffle_node_pairs 1 \
    -effective_resistance ./R_eff_U_100.csv \
    -output_sum_density_filename "/CON_U_100m.asc" \

: "walltime: $SECONDS seconds"

In addition, I have this inquiry from Niagara support team:

**"....Above, log file with nn08 means 8 nodes (320 cores) and nn16 means 16 nodes. As you can see GFlow invokes a function from PETSc library which requests 307GB memory independent of the total number of MPI ranks (cores). So this can point to a fundamental problem in the scalability of GFlow.

Do you know anyone used GFlow on hundreds of compute cores successfully? Is the scalability / performance proven on large HPC Clusters? "**

Thank you very much in advance!

Hossam

eduffy commented 6 years ago

Hi Hossam- yes, we've solved over a billion unknowns on at least 400 CPUs. I'm not sure what that 300gb allocation at startup is; we wrote this to be very memory conscience. What log is your admin referencing? Can you email that to me

sent from my mobile.

On Thu, Jun 7, 2018, 3:37 PM HossamGhub notifications@github.com wrote:

Hi Edward,

I am trying to use Gflow on computecanada (Niagara HPC) to compute connectivity using a 10m resolution resistance map for the extent of Alberta, Canada. After I successfully compiled and installed Gflow on Niagara, I started with a computation that I know for sure works (working with the 100m resolution). However, I ran into different memory problems. I needed to modify the bash script to have both the instructions for the cluster to allocate resources and the Gflow execution commands. Still having problems! I am attaching the most recent code I am using. Please let me know if I am doing anything wrong!

!/bin/bash

SBATCH --nodes=4

SBATCH --ntasks=160

SBATCH --time=2:00:00

SBATCH --job-name U_100_Nia7

SBATCH --mail-type=ALL

SBATCH --mail-user=hossam.hafez@utoronto.ca

SBATCH --output=mpi_U_100_Nia7.out

cd $SLURM_SUBMIT_DIR

module load intel/2018.2 intelmpi/2018.2 module load petsc/3.8.4

which mpiexec

export PETSC_DIR=${SCINET_PETSC_ROOT} export LD_LIBRARY_PATH=${PETSC_DIR}/lib:$LD_LIBRARY_PATH

OUTPUT_DIR=.

SECONDS=0 date

mpiexec ./gflow.x \ -habitat r_u_100.asc \ -nodes nodesYX_100m_MODIF \ -converge_at 3N \ -shuffle_node_pairs 1 \ -effective_resistance ./R_eff_U_100.csv \ -output_sum_density_filename "/CON_U_100m.asc" \

: "walltime: $SECONDS seconds"

In addition, I have this inquiry from Niagara support team:

**"....Above, log file with nn08 means 8 nodes (320 cores) and nn16 means 16 nodes. As you can see GFlow invokes a function from PETSc library which requests 307GB memory independent of the total number of MPI ranks (cores). So this can point to a fundamental problem in the scalability of GFlow.

Do you know anyone used GFlow on hundreds of compute cores successfully? Is the scalability / performance proven on large HPC Clusters? "**

Thank you very much in advance!

Hossam

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gflow/GFlow/issues/15#issuecomment-395539949, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0zmYc4BKJNHVNYoT2fw--Zd8MxhH96ks5t6YD5gaJpZM4M5ALb .

HossamGhub commented 6 years ago

Dear Edward,

Please see the thread below regarding the memory problem I have with Gflow on Niagara-SciNet. Please advise…

Thanks much, Hossam

Did you share your log files with the developers of GFlow? Could you try a smaller simulation on Niagara?

I want to share sth which is probably the reason of the error. Please follow carefully.

This is the log file from one of the simulations: fertinaz@nia-login06:/scratch/s/scinet/fertinaz/GFlow$ more log.gflow.nn08.txt Thu Jun 7 00:24:12 2018 >> Effective resistance will be written to /scratch/s/scinet/fertinaz/GFlow/R_eff_FUTW_10.csv. Thu Jun 7 00:24:12 2018 >> Simulation will converge at 0.999 Thu Jun 7 00:29:33 2018 >> (rows,cols) = (187884,109968) [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Out of memory. This could be due to allocating

Right before code fails, it prints three lines as you can see Thu Jun 7 00:24:12 2018 >> Effective resistance will be written to /scratch/s/scinet/fertinaz/GFlow/R_eff_FUTW_10.csv. Thu Jun 7 00:24:12 2018 >> Simulation will converge at 0.999 Thu Jun 7 00:29:33 2018 >> (rows,cols) = (187884,109968)

When you check those messages in the source code: fertinaz@nia-login06:~/GFlow$ grep -rn "Effective resistance will be written" * gflow.c:130: message("Effective resistance will be written to %s.\n", reff_path);

which leads to: if(strlen(reff_path) > 0) { message("Effective resistance will be written to %s.\n", reff_path); truncate(reff_path, 0); / Empty the file now, we'll have to repoen and append to it every iteration / } if(strlen(convergence) > 0) { char p; converge_at = strtod(convergence, &p); if(p[0] == 'N') converge_at = 1. - pow(10., -converge_at); if(converge_at < 0. || converge_at > 1.) { message("Error. Convergence factors must be between 0 and 1.\n"); MPI_Abort(MPI_COMM_WORLD, 1); } message("Simulation will converge at %lg\n", converge_at); } read_complete_solution(); / TODO: Need to remove this feature */ }

Code is executed successfully up to the last message function above. We know that because it is printed to the screen.

So it comes to the line at the bottom, read_complete_solution(); Now see that there is a comment next to this function which says TODO: need to remove this feature.

Also to check the content of that function: fertinaz@nia-login06:~/GFlow$ grep -rn "read_complete_solution()" gflow.c:144: read_complete_solution(); / TODO: Need to remove this feature */ output.c:358:void read_complete_solution() output.h:52:void read_complete_solution();

That function is implemented in output.c file. See line 358 in that file: // I hope to delete this section ASAP void read_complete_solution() { char solfile[PATH_MAX] = { 0 }; PetscBool flg; gzFile f; int count;

PetscOptionsGetString(PETSC_NULL, NULL, "-complete_solution", solfile, PATH_MAX, &flg); if(flg) { message("Reading complete solution from %s\n", solfile); f = gzopen(solfile, "r"); gzread(f, &count, sizeof(int)); final_current = (float )malloc(sizeof(float) count); gzread(f, final_current, sizeof(float) * count); gzclose(f); } }

As you see function starts with a comment “I hope to delete this section ASAP”

So, I hope the developer of GFlow will delete that section before your deadline.

Please contact them, send your log files, tell them how you installed as well. You can copy this email too. Also mention that your input file is a huge one. Hopefully they can make you some suggestions.

Hope this helps

// Fatih

On Jun 10, 2018, at 3:20 AM, Fatih Ertinaz <fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> wrote:

Hossam

The tempnode that I used to run this job on at UTM has 96GB of memory.

That’s interesting. As I said earlier each node on Niagara has 202GB memory (and 40 cores). Therefore when you use n nodes for your job, you will have n 202 GB memory (and n40 cores) dedicated resources available for your usage.

Are you using exactly the same inputs as your case at UTM?

Also my other questions remain:

Did you share your log files with the developers of GFlow? Could you try a smaller simulation on Niagara?

// Fatih

On Jun 9, 2018, at 10:35 PM, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> wrote:

Hi Fatih, The tempnode that I used to run this job on at UTM has 96GB of memory. How much memory do I have access to on Niagara? Is there a way to use more memory? Thank you very much Hossam

Sent from my iPhone

On Jun 8, 2018, at 5:51 PM, Fatih Ertinaz <fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> wrote:

Hossam

Gflow runs perfectly and computation time is fast, BUT unfortunately the final output is not produced because of the error at the end of the output file! Do I have enough space on Niagara for this output? I am running everything now on the $ SCRATCH. Again, this is a calculation that I’ve done before on a temporary node at UTM with much less power.

I think it is not a matter of power or space. It is a matter of memory.

GFlow might be doing the final file write operation in the master processor which then consumes all the memory in a single node no matter how many nodes you have for the computation. However I am just speculating right now, just trying to make guesses.

How much memory you have in the computer you use at UTM? Did you share your log files with the developers of GFlow? Could you try a smaller simulation on Niagara?

// Fatih

On Jun 8, 2018, at 5:15 PM, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> wrote:

Hi all,

Please check the attachments for the results of the last run after all fixes! In conclusion, Gflow runs perfectly and computation time is fast, BUT unfortunately the final output is not produced because of the error at the end of the output file! Do I have enough space on Niagara for this output? I am running everything now on the $ SCRATCH. Again, this is a calculation that I’ve done before on a temporary node at UTM with much less power. I am redoing it on Niagara to establish the procedure before I start the actual run that I need to get done by the 18th.

Thank you all so much for all the help and support,

Hossam


Hossam M. A. Abdel Moniem, Ph.D.

Postdoctoral Research Associate Department of Biology University of Totonto – Mississauga (UTM) 3359 Mississauga Road Mississauga, ON, Canada L5L 1C6


From: fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca<fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> Sent: Thursday, June 7, 2018 10:19 AM To: Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> Cc: mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca; Daniel Gruner <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca>;support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca Subject: Re: [SciNet-support] installing software

Hossam

I’ve checked your script. There are a couple of different problems.

First one is something you had earlier as well:

Why are you calling gflow.x two times? Is that really needed?

Your job script first executes: mpirun ./gflow.x

Right after that you call: sh execute_Hindex_new.sh

which runs: mpiexec -n 20 /gpfs/fs1/home/w/wagnerh1/hosscca/GFlow/gflow.x \ -habitat /gpfs/fs0/scratch/w/wagnerh1/hosscca/r_fu_new_10.asc \ -nodes nodesYX_10m_MODIF \ -converge_at 3N \ -shuffle_node_pairs 1 \ -effective_resistance ./R_eff_FUTW_10.csv \ -output_sum_density_filename "/gpfs/fs0/scratch/w/wagnerh1/hosscca/CON_FUTW_10m.asc” \

No need to run the same thing twice I assume.

Second problem:

It looks like "mpirun ./gflow.x" is not the right usage. It needs flags which are written inside the execute_Hindex_new.sh [root@nia-login05 gFlow]# grep -rn "does not exist" * FU_10_4.out:4: does not exist FU_10_5.out:4: does not exist gflow.c:122: message("%s does not exists\n", node_file); gflow.c:126: message("%s does not exists\n", node_pair_file); GView/gview.c:125: fprintf(stderr, "%s does not exist\n", filename); habitat.c:37: fprintf(stderr, "%s does not exist\n", filename); mean-current.c:69: fprintf(stderr, "%s does not exist\n", filename); mpi_U_100_Nia2.out:1: does not exist mpi_U_100_Nia3.out:1: does not exist mpi_U_100_Nia4.out:1: does not exist mpi_U_100_Nia.out:1: does not exist

The error messages above come from "mpirun ./gflow.x”

Since I didn't know how gflow works exactly, I didn’t check its details. I wanted to give you an overview how jobs are submitted and run parallel.

To resolve this problem, you should start with removing “mpirun ./gflow.x”.

Then just copy the following part from execute_Hindex.sh and paste it your current job script. All the rest in that script looks unnecessary.

mpiexec -n 20 /gpfs/fs1/home/w/wagnerh1/hosscca/GFlow/gflow.x \ -habitat /gpfs/fs0/scratch/w/wagnerh1/hosscca/r_fu_new_10.asc \ -nodes nodesYX_10m_MODIF \ -converge_at 3N \ -shuffle_node_pairs 1 \ -effective_resistance ./R_eff_FUTW_10.csv \ -output_sum_density_filename "/gpfs/fs0/scratch/w/wagnerh1/hosscca/CON_FUTW_10m.asc" \

You can replace mpiexec with mpirun or srun. Also get rid of "-n 20”. Even if you request 4 nodes (160 cores) you are using 20 cores because of that option.

Then don’t forget to remove "sh execute_Hindex_new.sh” as well.

This script is the one I use now. Change it according to your settings:

!/bin/bash

SBATCH --job-name=array_job_test # Job name

SBATCH --nodes=8

SBATCH --ntasks=320

SBATCH --time=03:00:00 # Time limit hrs:min:sec

SBATCH --output=mpioutput%j.txt

cd $SLURM_SUBMIT_DIR

module load intel/2018.2 intelmpi/2018.2 module load petsc/3.8.4

mpirun $HOME/GFlow/gflow.x \ -habitat $SCRATCH/GFlow/r_fu_new_10.asc \ -nodes nodesYX_10m_MODIF \ -converge_at 3N -shuffle_node_pairs 1 \ -effective_resistance $SCRATCH/GFlow/R_eff_FUTW_10.csv \ -output_sum_density_filename "$SCRATCH/GFlow/CON_FUTW_10m.asc"

Be careful that I am running GFlow.x which is installed in my home directory but write outputs to GFlow located in $SCRATCH. I suggest you to do the same: 1- To keep everything better organized 2- $HOME is a back-up are. In case you lose your codes, we can save them 3- As you know $SCRATCH is a larger disk space.

However this one also doesn’t solve the issue. Third problem: fertinaz@nia-login06:/scratch/s/scinet/fertinaz/GFlow$ grep -rn "Memory" log.gflow.* log.gflow.nn08.txt:8:[0]PETSC ERROR: Memory allocated 0 Memory used by process 154188738560 log.gflow.nn08.txt:10:[0]PETSC ERROR: Memory requested 330579643392 log.gflow.nn16.txt:8:[0]PETSC ERROR: Memory allocated 0 Memory used by process 154189119488 log.gflow.nn16.txt:10:[0]PETSC ERROR: Memory requested 330579643392

Above, log file with nn08 means 8 nodes (320 cores) and nn16 means 16 nodes. As you can see GFlow invokes a function from PETSc library which requests 307GB memory independent of the total number of MPI ranks (cores). So this can point to a fundamental problem in the scalability of GFlow.

Do you know anyone used GFlow on hundreds of compute cores successfully? Is the scalability / performance proven on large HPC Clusters?

// Fatih

On Jun 6, 2018, at 8:35 PM, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> wrote:

Hi Fatih, Marcelo and team,

Please have a look at the attached files. I am still getting the memory problem:

[0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or seehttp://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [0]PETSC ERROR: or tryhttp://valgrind.org http://valgrind.org/ on GNU/linux and Apple Mac OS X to find memory corruption errors [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [0]PETSC ERROR: to get more information on the crash. [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Signal received [0]PETSC ERROR: Seehttp://www.mcs.anl.gov/petsc/documentation/faq.html http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.8.4, Mar, 24, 2018 [0]PETSC ERROR: ./gflow.x on a arch-linux2-c-opt named nia0960.scinet.local by hosscca Wed Jun 6 17:48:04 2018 [0]PETSC ERROR: Configure options --prefix=/scinet/niagara/software/2018a/opt/intel-2018.2-intelmpi-2018.2/petsc/3.8.4 CC=mpicc CXX=mpicxx F77=mpif77 F90=mpif90 FC=mpifc COPTFLAGS="-march=native -O3" CXXOPTFLAGS="-march=native -O3" FOPTFLAGS="-march=native -O3" --download-chaco=1 --download-hypre=1 --download-metis=1 --download-ml=1 --download-mumps=1 --download-parmetis=1 --download-plapack=1 --download-prometheus=1 --download-ptscotch=1 --download-scotch=1 --download-sprng=1 --download-superlu=1 --download-superlu_dist=1 --download-triangle=1 --with-blaslapack-dir=/gpfs/fs1/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl --with-debugging=0 --with-mkl_pardiso-dir=/gpfs/fs1/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl --with-scalapack=1 --with-scalapack-lib="[/gpfs/fs1/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so,/gpfs/fs1/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so]" --with-x=0 [0]PETSC ERROR: #1 User provided function() line 0 in unknown file application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 Also, I am not sure what does the first line in the output file means: does not exist /scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mpi/intel64/bin/mpiexec

I changed my bash file “gflow_bash_new.sh” as I discussed with Fatih and: 1- requested more computation power. 2- Removed the tag –n from the gflow bash file “excute_Hindex_new.sh”

Please advise…

Thank you so much for all he help and support, Hossam

From: fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca<fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> Sent: Tuesday, June 5, 2018 10:35 PM To: Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> Cc: mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca; Daniel Gruner <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca>;support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca Subject: Re: [SciNet-support] installing software

It highly depends on the number of nodes and the number of jobs in the queue.

It could be normal.

On Jun 5, 2018, at 10:30 PM, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> wrote:

Hi Fattih,

Thanks a lot for you help. I followed you directions and I submitted my job around 4:00 pm. I modified the time to 4 hours but my job hasn’t scheduled yet! Is this normal?

Thank you again Hossam

Sent from my iPhone

On Jun 5, 2018, at 3:33 PM, Fatih Ertinaz <fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> wrote:

Hossam

You are not using 20 cores.

You are requesting 20 nodes and 20 MPI tasks!

SBATCH --nodes=20

SBATCH --ntasks=20

But then in the script you execute: ./gflow.x

So it runs serial even though you allocate 20 nodes = 800 compute cores.

Also you are loading two versions of MPI which is unnecessary: module load openmpi/3.1.0 module load intelmpi/2018.2

Copy and paste below to your script.

!/bin/bash

SBATCH --nodes=2

SBATCH --ntasks=80

SBATCH --time=10:00:00

SBATCH --job-name your_gflow_mpi_job

SBATCH --mail-type=AL

SBATCH --mail-user=hossam.hafez@utoronto.ca mailto:mail-user=hossam.hafez@utoronto.ca

SBATCH --output=mpioutput%j.txt

cd $SLURM_SUBMIT_DIR

module load intel/2018.2 intelmpi/2018.2 module load petsc/3.8.4

mpirun ./gflow.x

Use this one.

Change time according to your estimation. If 10 hours is not accurate reduce it to avoid longer queue time.

On Jun 5, 2018, at 3:18 PM, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> wrote:

Fatih,

Thank you for the prompt reply. Actually, this run is much smaller than the one required ~330GB. I used 20 nodes for this one and still getting the memory issue. It was almost done as I explained in my previous email. Should I increase the number of requested nodes? I am attaching my job submission file.

Thank you again,

Hossam

From:fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca<fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> Sent:Tuesday, June 5, 2018 3:09 PM To:Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> Cc:mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca; Daniel Gruner <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca>;support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca Subject:Re: [SciNet-support] installing software

Hossam

This is probably due to the “run out of memory” you had earlier.

See this line: [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range

Recall that each node has 40 physical cores and 202 GB memory.

Change the job script according to your needs and do not use 1 node because you will run out of memory as you need at least 307GB as far as I can remember.

Request 2 nodes for instance, so that your application will be able to use 404 GB memory.

The more nodes you will allocate, the more resources you will have - cores and memory. On the contrast your queue time may increase.

// Fatih

On Jun 5, 2018, at 3:03 PM, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> wrote:

Hi Marcelo and team,

Thank you for the valuable advice. I know this information about SBATCH command. I ran a job that I know it works on our UTM temporary node and I do have results for. I ran this same job on Niagara and everything looked fine and working perfectly and it was almost done. However, the job stopped because of the following error messages: I don’t know how to better overcome the memory issue (I am also attaching the output logfile). Could you please help with that? I apologize for sending many emails but I am new to Niagara and I am fighting a deadline on the 15th. All your help and support are very much appreciated.

Best, … Hossam

[0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or seehttp://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [0]PETSC ERROR: or tryhttp://valgrind.org http://valgrind.org/on GNU/linux and Apple Mac OS X to find memory corruption errors [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [0]PETSC ERROR: to get more information on the crash. [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Signal received [0]PETSC ERROR: Seehttp://www.mcs.anl.gov/petsc/documentation/faq.html http://www.mcs.anl.gov/petsc/documentation/faq.htmlfor trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.8.4, Mar, 24, 2018 [0]PETSC ERROR: ./gflow.x on a arch-linux2-c-opt named nia0447.scinet.local by hosscca Tue Jun 5 03:44:51 2018 [0]PETSC ERROR: Configure options --prefix=/scinet/niagara/software/2018a/opt/intel-2018.2-intelmpi-2018.2/petsc/3.8.4 CC=mpicc CXX=mpicxx F77=mpif77 F90=mpif90 FC=mpifc COPTFLAGS="-march=native -O3" CXXOPTFLAGS="-march=native -O3" FOPTFLAGS="-march=native -O3" --download-chaco=1 --download-hypre=1 --download-metis=1 --download-ml=1 --download-mumps=1 --download-parmetis=1 --download-plapack=1 --download-prometheus=1 --download-ptscotch=1 --download-scotch=1 --download-sprng=1 --download-superlu=1 --download-superlu_dist=1 --download-triangle=1 --with-blaslapack-dir=/gpfs/fs1/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl --with-debugging=0 --with-mkl_pardiso-dir=/gpfs/fs1/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl --with-scalapack=1 --with-scalapack-lib="[/gpfs/fs1/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so,/gpfs/fs1/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so]" --with-x=0 [0]PETSC ERROR: #1 User provided function() line 0 in unknown file application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0

From:mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca<mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca> Sent:Tuesday, June 5, 2018 12:22 PM To:Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca>; Fatih Ertinaz <fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> Cc:Daniel Gruner <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca>;support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca Subject:Re: [SciNet-support] installing software

Hello Hossam,

First of all, notice that the SBATCH keyword should always be preceeded by # so it can eb recognized by the scheduler as an instruction, so all of them should be #SBATCH ...

Perhaps the best way for you to understand this is to take a look at the documentation about Niagara,

https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart

and this brief intro video,

https://support.scinet.utoronto.ca/education/go.php/396/content.php/cid/1429/ https://support.scinet.utoronto.ca/education/go.php/396/content.php/cid/1429/

as many of these issues are addressed there,

Each node on Niagara has 40 cores, and you can oversubscribe them upto 80 logical cores. On Niagara we schedule resource by node, meaning that the minimum unit you can request for one of your jobs is a node, hence 40 physical cores!

SBATCH --nodes= specifies how many nodes you want to use for your job.

SBATCH --ntasks= specifies how many MPI processes you want to run, if you speficy this parameter then you can just call your program like,

mpirun $HOME/Glfow/glfow.x ....

ie. without explicitly stating -n...

Another things you may want to test is your performance on the node, as you can imagine having 40 physical cores than can actually be used as 80 are a lot of resources, so we ideally like having users utilizing all of these cores within the node. So this is what I'd recommend you to do:

  • take a fix problem, that you know what the solution should be, just a test case
  • run it on 1 node with 1 core
  • run it on 1 node with 2, 4, 8, 20 and 40 cores compare the results and times that took gflow to finish, if everything goes well you should always obtain the "same" result to your problem and the time it took for solving should decrease proportionally to the number of cores (assuming the problem is big enough to have enough work done by so many cores).

Regards, Marcelo

On 06/05/2018 12:11 AM, Hossam Abdel Hafez wrote: Thank you so much Marcelo, and Fatih for your continuous support and prompt replies. Fatih, regarding your question, I actually haven’t run Gflow within a job submission scheduler system before. I ran R jobs before on calculon at UTM. I used to run Gflow on a temporary node using just its executable bash file (in this case excute_Hindex_new.sh). I need to allocate the resources that will allow my calculations to run on Niagara using the bash gflow_bash.sh that I attached previously. This is the reason why I am trying to run it on Niagara as the resources on UTM Tempnode is limited. Could you please help with that? For example, what would be the best setup in my case regarding:

SBATCH --ntasks=?

SBATCH --nodes=?

and wither I need to keep the flag -n in mpiexec -n 4 ./gflow.x \

Thank you very much,

Hossam

On Jun 4, 2018, at 10:47 PM, Fatih Ertinaz <fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> wrote:

Hello Hossam

Further to Marcelo’s comments,

Your gflow_bash.sh script runs ./gflow.x on a single processor and then right after that line you execute execute_Hindex_new.sh script which also calls same ./gflow.x but this time using 4 cores.

What exactly you must do? Maybe we can build your script together if you clarify what you want to achieve.

// Fatih

On Jun 4, 2018, at 10:43 PM, Marcelo Ponce <mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca> wrote:

Hi Hossam,

You need to include all the modules you used for compiling Gflow, in this case as the error indicates at the very beginning of the error log, petsc is missing, basically you need to add more line with the module load command, in this case,

module load petsc/3.8.4

Yes, you need to run, or more concretely save your results on $SCRATCH when submitting jobs as $HOME is read only when running jobs. You can still have Gflow installed and running from $HOME but the results as any other files generated should be saved on $SCRATCH.

Regards, Marcelo

On Mon, 4 Jun 2018, Hossam Abdel Hafez wrote: Dear Marcelo and support team,

Please have a look at the attached file for my run. I am able to submit jobs now but they don’t run properly. Could you please guide me how to fix that? I installed Gflow now locally in my scratch folder because I was getting error messages that the folder that I am using in home directory is read only! Now I overcome this issue by installing Gflow on my scratch. I am attaching the output log file (I think the top part explains the problem), the bash script for the scheduler and the Gflow.

Thank you so much for the continuous help and support.

Hossam


Hossam M. A. Abdel Moniem, Ph.D.

Postdoctoral Research Associate Department of Biology University of Totonto – Mississauga (UTM) 3359 Mississauga Road Mississauga, ON, Canada L5L 1C6


-----Original Message----- From: Marcelo Ponce <mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca> Sent: Saturday, June 2, 2018 6:12 PM To: Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> Cc: 'fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca' <fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca>; 'Daniel Gruner' <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca>; 'support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca' <support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca> Subject: RE: [SciNet-support] installing software

Hello Hossam,

1) because you installed GFlow in your homedir, you can just call it like, ~/GFlow/gflow.x

2) yes, that what you need as so far the submission script is requesting resource but not executing anything.

A couple of observations regarding your submission script:

  • on Niagara, we schedule whole nodes per job, so the line --mem=40gb does not apply, you will get all the memory of each node in your job, ie. 192gb
  • the line --cpus-per-task is useful when you are using openMP and appears to me that gflow only uses mpi, is that correct? If that is the case, you just don't need it.
  • to be honest with you, I'm not sure whether the --requeue flag will work, but you can leave it if you want...

Alternatively for a first try you can try requesting a debugjob with one one and run your script to see how it goes and then submit the script as a proper job.

Regards, Marcelo

On Fri, 1 Jun 2018, Hossam Abdel Hafez wrote:

Hello again,

Could you please check and correct the attached bash script for the job I am trying to run? The things that I am not sure about now are:

1- How to call Gflow when it is not a module? I installed Gflow on my home directory

2- Can I add a line at the end of this bash file that is: sh excute_Hindex.sh, then run the whole thing as:

sbatch gflow_bash.sh

Thank you very much,

Hossam

From:fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca<fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> Sent: Friday, June 1, 2018 9:05 AM To: Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> Cc: Daniel Gruner <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca>; Marcelo Ponce <mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca>;support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca Subject: Re: [SciNet-support] installing software

Hossam

Find the line in the larger screenshot:

Memory requested 330579643392

which is around 307 GB.

I checked your script, it can run only on the login node on 20 cores because you are not requesting any compute nodes. Please submit your jobs to the queue and distribute your case among multiple nodes so that your case can fit into the 202GB memory of each node.

Please read documentation carefully:

https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Submittin https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Submittin g_jobs

  1. Prepare a job script, examples can be found in the link above

  2. Submit your job: sbatch yourjobscript.sh

// Fatih

On 1 Jun 2018, at 04:08, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> wrote:

Dear SciNet team,

First, thank you again for the outstanding support and effort. Second, please have a look at the attached screenshots with the error messages I got after running Gflow. Please advise how to fix. I successfully uploaded my files to the scratch folder as directed. I am also attaching the execution .sh file that I am running. The important piece is at the end of the file. Please advise if I am specifying the path for the input and output files correctly.

Thank you again for your support,

Hossam

From: Hossam Abdel Hafez Sent: Thursday, May 31, 2018 6:29 PM To: Daniel Gruner <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca>; Marcelo Ponce <mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca>;fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca Cc:support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca Subject: Re: [SciNet-support] installing software

Thank you all for the follow-up and the prompt answers.

Hossam

————————————————

Hossam Abdel Moniem, Ph.D.

Postdoctoral Research Associate

Department of Biology

University of Toronto – Mississauga (UTM)

3359 Mississauga Rd.

Mississauga, ON, Canada L5L 1J7

From: <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca> on behalf of Daniel Gruner <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca> Date: Thursday, May 31, 2018 at 4:27 PM To: Marcelo Ponce <mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca> Cc: Daniel Gruner <dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.ca>, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca>, "support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca" <support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca> Subject: Re: [SciNet-support] installing software

Marcelo,

It would be /gpfs/fs0/scratch/w/wagnerh1/hosscca

(not fs1).

Danny

On May 31, 2018, at 4:25 PM, Marcelo Ponce <mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca> wrote:

Hello Hossam,

As Fatih mentioned to you before, the problem is that your quota (ie allowed space in HOME) is less than that! Hence you need to target a different destination, instead of HOME copy the file in to SCRATCH Ie. in the destination path you should change "home" by "scratch", ie. "/gpfs/fs1/scratch/w/wagnerh1/hosscca" Also please read carfully the link Fatih sent to you earlier, one key point is that your file is ASCII, so you could in principle compress it before moving it, having the advantage that this will significantly reduce the size of the file to transfer and then uncompress it on the Niagara cluster.

Regards, Marcelo

On 05/31/2018 03:28 PM, Hossam Abdel Hafez wrote:

Dear Scinet team,

I am still unable to move the data I want from my desktop to my Scinet folder on Niagara! I am trying to upload a file that is 200 GB. I used to use WinSCP to upload my data when I was working with hpctempnode1 node. I still can use WinSCP to navigate between my desktop and Niagara folder, however, I couldn’t locate my SCRATCH folder as mentioned in the response below. Please explain in a stepwise (please see Marcello’s response below) way how to do so. Attached is a screenshot of my WinSCP.

Thank you in advance,

Hossam

From:fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca<fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> Sent: Thursday, May 31, 2018 12:33 AM To: Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> Cc:support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca; Helene Wagner <helene.wagner@utoronto.ca mailto:helene.wagner@utoronto.ca> Subject: Re: [SciNet-support] installing software

Sorry, please use this link:

https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Moving_da https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart#Moving_da ta

// Fatih

On 31 May 2018, at 00:26, Fatih Ertinaz <fertinaz@scinet.utoronto.ca mailto:fertinaz@scinet.utoronto.ca> wrote:

Hossam

Try to upload that to your $SCRATCH folder. Also read the following link for hints about data transfers:

https://docs.computecanada.ca/wiki/Niagara_Quickstart#Moving_data https://docs.computecanada.ca/wiki/Niagara_Quickstart#Moving_data

// Fatih

On 30 May 2018, at 23:52, Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca> wrote:

Dear Scinet team,

Thanks much for the instructions to install Gflow under my home directory! However, I am trying to upload the file that I’ll be working with which is about 200 GB and I am getting an error message probably because of the user’s quote. I really need to get this work done as I am fighting a deadline on the 18th. Please advise.

Thank you again and looking forward to hearing back from you soon,

Regards,

Hossam

From: <mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca> on behalf of Marcelo Ponce <mponce@scinet.utoronto.ca mailto:mponce@scinet.utoronto.ca> Date: Monday, May 28, 2018 at 4:02 PM To: Hossam Abdel Hafez <hossam.hafez@utoronto.ca mailto:hossam.hafez@utoronto.ca>, "support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca" <support@scinet.utoronto.ca mailto:support@scinet.utoronto.ca> Subject: Re: [SciNet-support] installing software

Hello Hossam,

1) It is pretty straightforward to install GFlow in your local directory. Just follow these steps: i) clone the repo in your home-dir

ii) load the following modules on niagara: intel/2018.2 intelmpi/2018.2 petsc/3.8.4

iii) cd into the GFlow directory and edit the makefile, so that the line

PETSC_DIR=/usr/local/Cellar/petsc/3.7.3/real

now reads

PETSC_DIR=${SCINET_PETSC_ROOT}

iv) then just type make and it will generate the executable in that directory.

WRT the R packages we don't install any packages locally, but you can install them in your own library using the usual R install.packages() command. I'd recommend you just give a try to installing them as you will do in your own computer using a local library.

Please notice that you will need to submit jobs when running your programs for production on Niagara, so if you are not familiar with those, I would recommend you take a quick look at the following documentation:

https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart

https://support.scinet.utoronto.ca/education/go.php/396/content.php/ci https://support.scinet.utoronto.ca/education/go.php/396/content.php/ci d/1429/

https://support.scinet.utoronto.ca/education/go.php/396/content.php/ci https://support.scinet.utoronto.ca/education/go.php/396/content.php/ci d/1428/

Let us know if you have any questions.

Regards, Marcelo

On 05/28/2018 03:40 PM, Hossam Abdel Hafez wrote:

Dear support team,

Could you please help me installing the following software to use under my account (hosscca)?

  1. Gflow:https://github.com/gflow/GFlow https://github.com/gflow/GFlow(this one is a priority please) 2. I know R is installed, however, I need to know if these libraries are available because I need them (please disregard the version)

Package Version

unixtools

doParallel 1.0.10

iterators 1.0.8

maptools 0.9-2

MicrosoftR 3.4.1.0077

png 0.1-7

SDMTools

raster 2.5-8

rgdal 1.2-8

rgeos 0.3-23

RUnit 0.4.26

sp 1.2-5

Kindest regards,

Hossam


Hossam M. A. Abdel Moniem, Ph.D.

Postdoctoral Research Associate

Department of Biology

University of Totonto – Mississauga (UTM)

3359 Mississauga Road Mississauga, ON, Canada L5L 1C6







--

Dr. Daniel Gruner dgruner@scinet.utoronto.ca mailto:dgruner@scinet.utoronto.caChief Technical Officer phone: (416)-978-2775 SciNet High Performance Computing Consortium www.scinethpc.ca http://www.scinethpc.ca/Compute/Calcul Canada www.computecanada.ca http://www.computecanada.ca/


On Jun 7, 2018, at 7:20 PM, Edward Duffy notifications@github.com wrote:

Hi Hossam- yes, we've solved over a billion unknowns on at least 400 CPUs. I'm not sure what that 300gb allocation at startup is; we wrote this to be very memory conscience. What log is your admin referencing? Can you email that to me

sent from my mobile.

On Thu, Jun 7, 2018, 3:37 PM HossamGhub notifications@github.com wrote:

Hi Edward,

I am trying to use Gflow on computecanada (Niagara HPC) to compute connectivity using a 10m resolution resistance map for the extent of Alberta, Canada. After I successfully compiled and installed Gflow on Niagara, I started with a computation that I know for sure works (working with the 100m resolution). However, I ran into different memory problems. I needed to modify the bash script to have both the instructions for the cluster to allocate resources and the Gflow execution commands. Still having problems! I am attaching the most recent code I am using. Please let me know if I am doing anything wrong!

!/bin/bash

SBATCH --nodes=4

SBATCH --ntasks=160

SBATCH --time=2:00:00

SBATCH --job-name U_100_Nia7

SBATCH --mail-type=ALL

SBATCH --mail-user=hossam.hafez@utoronto.ca

SBATCH --output=mpi_U_100_Nia7.out

cd $SLURM_SUBMIT_DIR

module load intel/2018.2 intelmpi/2018.2 module load petsc/3.8.4

which mpiexec

export PETSC_DIR=${SCINET_PETSC_ROOT} export LD_LIBRARY_PATH=${PETSC_DIR}/lib:$LD_LIBRARY_PATH

OUTPUT_DIR=.

SECONDS=0 date

mpiexec ./gflow.x \ -habitat r_u_100.asc \ -nodes nodesYX_100m_MODIF \ -converge_at 3N \ -shuffle_node_pairs 1 \ -effective_resistance ./R_eff_U_100.csv \ -output_sum_density_filename "/CON_U_100m.asc" \

: "walltime: $SECONDS seconds"

In addition, I have this inquiry from Niagara support team:

**"....Above, log file with nn08 means 8 nodes (320 cores) and nn16 means 16 nodes. As you can see GFlow invokes a function from PETSc library which requests 307GB memory independent of the total number of MPI ranks (cores). So this can point to a fundamental problem in the scalability of GFlow.

Do you know anyone used GFlow on hundreds of compute cores successfully? Is the scalability / performance proven on large HPC Clusters? "**

Thank you very much in advance!

Hossam

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/gflow/GFlow/issues/15#issuecomment-395539949, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0zmYc4BKJNHVNYoT2fw--Zd8MxhH96ks5t6YD5gaJpZM4M5ALb .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gflow/GFlow/issues/15#issuecomment-395595876, or mute the thread https://github.com/notifications/unsubscribe-auth/Af4IxI463r4YpKLf7bhopEL86XBDfCXhks5t6bUpgaJpZM4M5ALb.

HossamGhub commented 6 years ago

@eduffy @pbleonard @RMarrec

Hi Edward and Paul code_and_output_error.zip

Please see the thread below regarding the memory problem I have with Gflow on Niagara-SciNet. Please advise…

Thank you very much and sorry to bothering you, Hossam

Did you share your log files with the developers of GFlow? Could you try a smaller simulation on Niagara?

I want to share this which is probably the reason of the error. Please follow carefully.

This is the log file from one of the simulations: fertinaz@nia-login06:/scratch/s/scinet/fertinaz/GFlow$ more log.gflow.nn08.txt Thu Jun 7 00:24:12 2018 >> Effective resistance will be written to /scratch/s/scinet/fertinaz/GFlow/R_eff_FUTW_10.csv. Thu Jun 7 00:24:12 2018 >> Simulation will converge at 0.999 Thu Jun 7 00:29:33 2018 >> (rows,cols) = (187884,109968) [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Out of memory. This could be due to allocating

Right before code fails, it prints three lines as you can see Thu Jun 7 00:24:12 2018 >> Effective resistance will be written to /scratch/s/scinet/fertinaz/GFlow/R_eff_FUTW_10.csv. Thu Jun 7 00:24:12 2018 >> Simulation will converge at 0.999 Thu Jun 7 00:29:33 2018 >> (rows,cols) = (187884,109968)

When you check those messages in the source code: fertinaz@nia-login06:~/GFlow$ grep -rn "Effective resistance will be written" * gflow.c:130: message("Effective resistance will be written to %s.\n", reff_path);

which leads to: if(strlen(reff_path) > 0) { message("Effective resistance will be written to %s.\n", reff_path); truncate(reff_path, 0); / Empty the file now, we'll have to repoen and append to it every iteration / } if(strlen(convergence) > 0) { char p; converge_at = strtod(convergence, &p); if(p[0] == 'N') converge_at = 1. - pow(10., -converge_at); if(converge_at < 0. || converge_at > 1.) { message("Error. Convergence factors must be between 0 and 1.\n"); MPI_Abort(MPI_COMM_WORLD, 1); } message("Simulation will converge at %lg\n", converge_at); } read_complete_solution(); / TODO: Need to remove this feature */ }

Code is executed successfully up to the last message function above. We know that because it is printed to the screen.

So it comes to the line at the bottom, read_complete_solution(); Now see that there is a comment next to this function which says TODO: need to remove this feature.

Also to check the content of that function: fertinaz@nia-login06:~/GFlow$ grep -rn "read_complete_solution()" gflow.c:144: read_complete_solution(); / TODO: Need to remove this feature */ output.c:358:void read_complete_solution() output.h:52:void read_complete_solution();

That function is implemented in output.c file. See line 358 in that file: // I hope to delete this section ASAP void read_complete_solution() { char solfile[PATH_MAX] = { 0 }; PetscBool flg; gzFile f; int count;

PetscOptionsGetString(PETSC_NULL, NULL, "-complete_solution", solfile, PATH_MAX, &flg); if(flg) { message("Reading complete solution from %s\n", solfile); f = gzopen(solfile, "r"); gzread(f, &count, sizeof(int)); final_current = (float )malloc(sizeof(float) count); gzread(f, final_current, sizeof(float) * count); gzclose(f); } }

As you see function starts with a comment “I hope to delete this section ASAP”

So, I hope the developer of GFlow will delete that section before your deadline.

Please contact them, send your log files, tell them how you installed as well. You can copy this email too. Also mention that your input file is a huge one. Hopefully they can make you some suggestions.

Hope this helps

// Fatih

eduffy commented 6 years ago

Hossam - The only thing that looks strange in your bash script is that you're sending your output to the "/" directory. While that will probably cause problems, that doesn't account for the "out of memory" error. Does your cluster allow for guest accounts? It'll be quicker and easier if I can log in and try to run this myself, instead of going through this.

On Mon, Jun 11, 2018 at 2:08 AM, HossamGhub notifications@github.com wrote:

@eduffy https://github.com/eduffy @pbleonard https://github.com/pbleonard @RMarrec https://github.com/RMarrec

Hi Edward and Paul code_and_output_error.zip https://github.com/gflow/GFlow/files/2088852/code_and_output_error.zip

Please see the thread below regarding the memory problem I have with Gflow on Niagara-SciNet. Please advise…

Thank you very much and sorry to bothering you, Hossam

Did you share your log files with the developers of GFlow? Could you try a smaller simulation on Niagara?

I want to share this which is probably the reason of the error. Please follow carefully.

This is the log file from one of the simulations: fertinaz@nia-login06:/scratch/s/scinet/fertinaz/GFlow$ more log.gflow.nn08.txt Thu Jun 7 00:24:12 2018 >> Effective resistance will be written to /scratch/s/scinet/fertinaz/GFlow/R_eff_FUTW_10.csv. Thu Jun 7 00:24:12 2018 >> Simulation will converge at 0.999 Thu Jun 7 00:29:33 2018 >> (rows,cols) = (187884,109968) [0]PETSC ERROR: --------------------- Error Message

[0]PETSC ERROR: Out of memory. This could be due to allocating

Right before code fails, it prints three lines as you can see Thu Jun 7 00:24:12 2018 >> Effective resistance will be written to /scratch/s/scinet/fertinaz/GFlow/R_eff_FUTW_10.csv. Thu Jun 7 00:24:12 2018 >> Simulation will converge at 0.999 Thu Jun 7 00:29:33 2018 >> (rows,cols) = (187884,109968)

When you check those messages in the source code: fertinaz@nia-login06:~/GFlow$ grep -rn "Effective resistance will be written" * gflow.c:130: message("Effective resistance will be written to %s.\n", reff_path);

which leads to: if(strlen(reff_path) > 0) { message("Effective resistance will be written to %s.\n", reff_path); truncate(reff_path, 0); / Empty the file now, we'll have to repoen and append to it every iteration / } if(strlen(convergence) > 0) { char

*p; converge_at = strtod(convergence, &p); if(p[0] == 'N') converge_at =

    • pow(10., -converge_at); if(converge_at < 0. || converge_at > 1.) { message("Error. Convergence factors must be between 0 and 1.\n"); MPI_Abort(MPI_COMM_WORLD, 1); } message("Simulation will converge at %lg\n", converge_at); } read_complete_solution(); / TODO: Need to remove this feature / }

Code is executed successfully up to the last message function above. We know that because it is printed to the screen.

So it comes to the line at the bottom, read_complete_solution(); Now see that there is a comment next to this function which says TODO: need to remove this feature.

Also to check the content of that function: fertinaz@nia-login06:~/GFlow$ grep -rn "read_complete_solution()" gflow.c:144: read_complete_solution(); / TODO: Need to remove this feature */ output.c:358:void read_complete_solution() output.h:52:void read_complete_solution();

That function is implemented in output.c file. See line 358 in that file: // I hope to delete this section ASAP void read_complete_solution() { char solfile[PATH_MAX] = { 0 }; PetscBool flg; gzFile f; int count;

PetscOptionsGetString(PETSC_NULL, NULL, "-complete_solution", solfile, PATH_MAX, &flg); if(flg) { message("Reading complete solution from %s\n", solfile); f = gzopen(solfile, "r"); gzread(f, &count, sizeof(int)); final_current = (float )malloc(sizeof(float) count); gzread(f, final_current, sizeof(float) * count); gzclose(f); } }

As you see function starts with a comment “I hope to delete this section ASAP”

So, I hope the developer of GFlow will delete that section before your deadline.

Please contact them, send your log files, tell them how you installed as well. You can copy this email too. Also mention that your input file is a huge one. Hopefully they can make you some suggestions.

Hope this helps

// Fatih

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gflow/GFlow/issues/15#issuecomment-396132131, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0zmc7rL1vuzSEFUNBReuiSqda2L6Bhks5t7glagaJpZM4M5ALb .

HossamGhub commented 6 years ago

Hi Fatih and Edward,

Thank you so much both! I really appreciate all your help and support in this process. Edward, please see the message from Fatih below. Please CC me as well as I would like to learn what went wrong or how to do that independently in the future.

Best, Hossam

On Jun 12, 2018, at 10:17 PM, Fatih Ertinaz fertinaz@scinet.utoronto.ca<mailto:fertinaz@scinet.utoronto.ca> wrote:

Hossam

Sorry I don’t think we can create guest accounts however I will contact the developer myself tomorrow.

// Fatih

On Jun 12, 2018, at 8:53 PM, Hossam Abdel Hafez hossam.hafez@utoronto.ca<mailto:hossam.hafez@utoronto.ca> wrote:

Hi Fatih and team,

Please see the message below from Gflow developers. Can we do that?

Thanks, Hossam

Sent from my iPhone

Begin forwarded message:

From: Edward Duffy notifications@github.com<mailto:notifications@github.com> Date: June 12, 2018 at 8:45:45 PM EDT To: gflow/GFlow GFlow@noreply.github.com<mailto:GFlow@noreply.github.com> Cc: HossamGhub hossamesapres@gmail.com<mailto:hossamesapres@gmail.com>, Comment comment@noreply.github.com<mailto:comment@noreply.github.com> Subject: Re: [gflow/GFlow] Error message: invalid points (#15) Reply-To: gflow/GFlow reply@reply.github.com<mailto:reply@reply.github.com>

Hossam - The only thing that looks strange in your bash script is that you're sending your output to the "/" directory. While that will probably cause problems, that doesn't account for the "out of memory" error. Does your cluster allow for guest accounts? It'll be quicker and easier if I can log in and try to run this myself, instead of going through this.

On Mon, Jun 11, 2018 at 2:08 AM, HossamGhub notifications@github.com<mailto:notifications@github.com> wrote:

@eduffy https://github.com/eduffy @pbleonard https://github.com/pbleonard @RMarrec https://github.com/RMarrec

Hi Edward and Paul code_and_output_error.zip https://github.com/gflow/GFlow/files/2088852/code_and_output_error.zip

Please see the thread below regarding the memory problem I have with Gflow on Niagara-SciNet. Please advise…

Thank you very much and sorry to bothering you, Hossam

Did you share your log files with the developers of GFlow? Could you try a smaller simulation on Niagara?

I want to share this which is probably the reason of the error. Please follow carefully.

This is the log file from one of the simulations: fertinaz@nia-login06:/scratch/s/scinet/fertinaz/GFlow$ more log.gflow.nn08.txt Thu Jun 7 00:24:12 2018 >> Effective resistance will be written to /scratch/s/scinet/fertinaz/GFlow/R_eff_FUTW_10.csv. Thu Jun 7 00:24:12 2018 >> Simulation will converge at 0.999 Thu Jun 7 00:29:33 2018 >> (rows,cols) = (187884,109968) [0]PETSC ERROR: --------------------- Error Message

[0]PETSC ERROR: Out of memory. This could be due to allocating

Right before code fails, it prints three lines as you can see Thu Jun 7 00:24:12 2018 >> Effective resistance will be written to /scratch/s/scinet/fertinaz/GFlow/R_eff_FUTW_10.csv. Thu Jun 7 00:24:12 2018 >> Simulation will converge at 0.999 Thu Jun 7 00:29:33 2018 >> (rows,cols) = (187884,109968)

When you check those messages in the source code: fertinaz@nia-login06:~/GFlow$ grep -rn "Effective resistance will be written" * gflow.c:130: message("Effective resistance will be written to %s.\n", reff_path);

which leads to: if(strlen(reff_path) > 0) { message("Effective resistance will be written to %s.\n", reff_path); truncate(reff_path, 0); / Empty the file now, we'll have to repoen and append to it every iteration / } if(strlen(convergence) > 0) { char

*p; converge_at = strtod(convergence, &p); if(p[0] == 'N') converge_at =

    • pow(10., -converge_at); if(converge_at < 0. || converge_at > 1.) { message("Error. Convergence factors must be between 0 and 1.\n"); MPI_Abort(MPI_COMM_WORLD, 1); } message("Simulation will converge at %lg\n", converge_at); } read_complete_solution(); / TODO: Need to remove this feature / }

Code is executed successfully up to the last message function above. We know that because it is printed to the screen.

So it comes to the line at the bottom, read_complete_solution(); Now see that there is a comment next to this function which says TODO: need to remove this feature.

Also to check the content of that function: fertinaz@nia-login06:~/GFlow$ grep -rn "read_complete_solution()" gflow.c:144: read_complete_solution(); / TODO: Need to remove this feature */ output.c:358:void read_complete_solution() output.h:52:void read_complete_solution();

That function is implemented in output.c file. See line 358 in that file: // I hope to delete this section ASAP void read_complete_solution() { char solfile[PATH_MAX] = { 0 }; PetscBool flg; gzFile f; int count;

PetscOptionsGetString(PETSC_NULL, NULL, "-complete_solution", solfile, PATH_MAX, &flg); if(flg) { message("Reading complete solution from %s\n", solfile); f = gzopen(solfile, "r"); gzread(f, &count, sizeof(int)); final_current = (float )malloc(sizeof(float) count); gzread(f, final_current, sizeof(float) * count); gzclose(f); } }

As you see function starts with a comment “I hope to delete this section ASAP”

So, I hope the developer of GFlow will delete that section before your deadline.

Please contact them, send your log files, tell them how you installed as well. You can copy this email too. Also mention that your input file is a huge one. Hopefully they can make you some suggestions.

Hope this helps

// Fatih

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gflow/GFlow/issues/15#issuecomment-396132131, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0zmc7rL1vuzSEFUNBReuiSqda2L6Bhks5t7glagaJpZM4M5ALb .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/gflow/GFlow/issues/15#issuecomment-396778338, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Af4IxAGS7rP1-CKLuDWue_viHUbuwdacks5t8GC5gaJpZM4M5ALb.