SPARC-X / SPARC

Simulation Package for Ab-initio Real-space Calculations
GNU General Public License v3.0
73 stars 38 forks source link

Segmentation fault after running the compiled sparc-x on anvil and rockfish cluster #182

Closed Jiang-eat-sugar closed 1 year ago

Jiang-eat-sugar commented 1 year ago

Hi,

I compiled the sparc-x on anvil and rockfish cluster without showing any error message. But the segmentation fault error, as shown below, come out when I was running the compiled sparc-x. Does anyone have any clue about this error? I know the segmentation error means that the program accessed an unassigned memory location. It might be related to the misbehaving in the code. So, I attached my job file here.

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e8)

Below is my job file:

!/bin/bash

SBATCH -A dmr160007 # Allocation name

SBATCH --nodes=1 # Total # of nodes

SBATCH --ntasks=128 # Total # of MPI tasks

SBATCH --time=00:30:00 # Total run time limit (hh:mm:ss)

SBATCH -J sparc # Job name

SBATCH -p wholenode # Queue (partition) name

module --force purge

cd $SLURM_SUBMIT_DIR module reset module load intel-mkl/2019.5.281 echo $PWD mpirun -np 128 ./sparc -name BTS

Thanks, Jiang

YaphetS-jx commented 1 year ago

Hi Jiang,

Could you please share the input files BTS.inpt and BTS.ion with us so that we could look into it? Btw, please be sure to use the same modules for compilation and running.

Best, Xin

Jiang-eat-sugar commented 1 year ago

Hi Xin,

Below are the attached inpt and ion files on two clusters. I have double-checked the modules and no matter whether I load the modules or not, the problem insists. BTS_anvil.zip BTS_rockfish.zip

Thanks, Jiang

Jiang-eat-sugar commented 1 year ago

Actually, the sparc-x job was running OK until it has the segmentation error. Below is more error message. fchrg = 6.00000000 > 0.0 (icmod != 0) This pseudopotential contains non-linear core correction.

fchrg = 6.000000, READING MODEL CORE CHARGE!

fchrg = 5.00000000 > 0.0 (icmod != 0) This pseudopotential contains non-linear core correction.

fchrg = 5.000000, READING MODEL CORE CHARGE!

fchrg = 2.50000000 > 0.0 (icmod != 0) This pseudopotential contains non-linear core correction.

fchrg = 2.500000, READING MODEL CORE CHARGE! [c035:40165] Process received signal [c035:40165] Signal: Segmentation fault (11)

phanish-suryanarayana commented 1 year ago

Can you also provide the output file.

On an unrelated note, unless done intentionally, I think your SCF tolerance is too strict. Note that the SCF tolerance in SPARC is defined in terms of the convergence of the electron density/potential. Typical values would be 1e-4 (standard accuracy) to 1e-6 (high accuracy).

Jiang-eat-sugar commented 1 year ago

Below are the output files from two clusters: Rockfish slurm-17406227.zip Anvil slurm-2165936.zip

I was using the same tolerance setting as the vasp to compare the results. We need a tight setting for this structure.

YaphetS-jx commented 1 year ago

I run your tests on our hive cluster and it works perfectly and no bug is encounterred. Does sparc generate .out file for you, like BTS_rockfish.out? If so, could you share with us? And one more check for you is the memory requested for the job. I didn't see any memory request option in your job file. Make sure it's more than 20 Gb for this large system.

Jiang-eat-sugar commented 1 year ago

I compiled sparc-X on the stampede (another supercomputer), which works perfectly also. No, the slurm files are all the output that I have. I feel a little weird why not have the output files and suspect the error comes from the unassigned memory position. The default memory is 256 GB. I will check if this is the problem causing the error. But it was OK when I ran the sparc-x job on stampede that also has 256 GB memory.

Jiang-eat-sugar commented 1 year ago

I tried another supercomputer cluster, expanse, which has the same segmentation fault. I am attaching everything here after I ran the sparc job, except a bunch of core.* files.
sparc-test.tar.gz

YaphetS-jx commented 1 year ago

Sorry, from your log file, there is no useful information for debugging. And since there is no .out file generated by sparc at all, I assume that sparc didn't start successfully at all. I would suggest you to seek help from technicians for the supercomputer cluster.

phanish-suryanarayana commented 1 year ago

Were you able to successfully run any of the tests provided with SPARC on these computer clusters? That would always be a good first step to check the installation.

Jiang-eat-sugar commented 1 year ago

Sorry, from your log file, there is no useful information for debugging. And since there is no .out file generated by sparc at all, I assume that sparc didn't start successfully at all. I would suggest you to seek help from technicians for the supercomputer cluster.

Thanks! I am waiting for their help.

Do you know why the output file shows "This pseudopotential contains non-linear core correction.". I thought it indicates the sparc-x is reading the pseudopotential file.

Jiang-eat-sugar commented 1 year ago

Were you able to successfully run any of the tests provided with SPARC on these computer clusters? That would always be a good first step to check the installation.

I tried and those test jobs also have the segmentation fault.

phanish-suryanarayana commented 1 year ago

Then it is likely that the code has not been successfully compiled

Jiang-eat-sugar commented 1 year ago

Take the rockfish cluster as example, I used the module below: module load intel-compilers/2022.0 module load intel-mpi/2021.5 module load intel-mkl/2022.0

And then I went to the src folder and ran "make clean; make" (Because I used default mkl, I did not change any of the makefile.) During the compiling process, It did not show any error message.

Another example is stampede where I am able to run sparc-x job. module load gcc/9.1.0 module load mkl/19.1.1

Then I did same thing as what I did on rockfish cluster. The difference between rockfish and stampede is the module. intel-mkl/2022.0 is the what rockfish has when I use module spider command.

Jiang-eat-sugar commented 1 year ago

Thank you all for your help! I found that I was using OpenMPI for compiling but linked IntelMKL when compiled the code here:

LDLIBS += Wl,-rpath=${MKLROOT}/lib/intel64,-no-as-needed -lmkl_scalapack_lp64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -ldl

So, I changed: LDLIBS += Wl,-rpath=-no-as-needed -lmkl_scalapack_lp64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_openmpi_lp64 -lpthread -lm -ldl