deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
173 stars 132 forks source link

error in hse calculation #4029

Closed floatingCatty closed 6 months ago

floatingCatty commented 6 months ago

Describe the bug

When computing hse scf task on GaSbN system, it keeps raise this error: image

Expected behavior

Make the calculation function normally.

To Reproduce

GaSbN96_hse.zip

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

PeizeLin commented 6 months ago

The system works well in my environment. Could you add -DDEBUG_INFO=ON when cmake and rerun abacus, to find out which function raise the error in running_scf.log?

floatingCatty commented 6 months ago

The system works well in my environment. Could you add -DDEBUG_INFO=ON when cmake and rerun abacus, to find out which function raise the error in running_scf.log?

Hello, I tried build it according to your advice. image But it shows that PkgConfig is missing. I am not sure if this is correct. I am running this on Bohrium machine with image registry.dp.tech/dptech/abacus:3.6.1, with ali machine and c64_m256_cpu. I hope this can help you to repeat this error.

floatingCatty commented 6 months ago

This is the running_scf.log that generated during the failed task. running_scf.log

WHUweiqingzhou commented 6 months ago

@floatingCatty and @PeizeLin,

I add -DDEBUG_INFO=ON and compile successfully, @floatingCatty you can refer to /root/abacus-develop/Dockerfile.intel to do this.

@PeizeLin I have reproduced the result at ali machine with c32_m64_cpu:

OMP_NUM_THREADS=32 mpirun -np 1 abacus |tee log

                              ABACUS v3.6.1

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: fcc381479 (Fri Apr 19 15:00:06 2024 +0800)

 Thu Apr 25 13:35:52 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : CPU / Intel(R) Xeon(R) Platinum
 dft_functional readin is: hse
 dft_functional in pseudopot file is: PBE
 Please make sure this is what you need
 dft_functional readin is: hse
 dft_functional in pseudopot file is: PBE
 Please make sure this is what you need
 dft_functional readin is: hse
 dft_functional in pseudopot file is: PBE
 Please make sure this is what you need

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Warning: the number of valence electrons in pseudopotential > 3 for Ga: [Ar] 3d10 4s2 4p1
 Warning: the number of valence electrons in pseudopotential > 5 for Sb: [Kr] 4d10 5s2 5p3
 Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
 If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 UNIFORM GRID DIM        : 180 * 192 * 180
 UNIFORM GRID DIM(BIG)   : 45 * 48 * 45
 DONE(1.24341    SEC) : SETUP UNITCELL
 DONE(1.24544    SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  NBASE       
 1       42              1           1836        
 ---------------------------------------------------------
 Use Systematically Improvable Atomic bases
 ---------------------------------------------------------
 ELEMENT ORBITALS        NBASE       NATOM       XC          
 Ga      2s2p2d1f-8au    25          48          
 N       2s2p1d-7au      13          47          
 Sb      2s2p2d1f-7au    25          1           
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(2.5721     SEC) : INIT PLANEWAVE
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
terminate called after throwing an instance of 'cereal::Exception'
  what():  Failed to read 8 bytes from input stream! Read 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 25778 RUNNING AT bohrium-13341-1125055
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

You can check the result in the attchement:

GaSbN96_hse_debug.zip

floatingCatty commented 6 months ago

@floatingCatty and @PeizeLin,

I add -DDEBUG_INFO=ON and compile successfully, @floatingCatty you can refer to /root/abacus-develop/Dockerfile.intel to do this.

@PeizeLin I have reproduced the result at ali machine with c32_m64_cpu:

OMP_NUM_THREADS=32 mpirun -np 1 abacus |tee log

                              ABACUS v3.6.1

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: fcc381479 (Fri Apr 19 15:00:06 2024 +0800)

 Thu Apr 25 13:35:52 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : CPU / Intel(R) Xeon(R) Platinum
 dft_functional readin is: hse
 dft_functional in pseudopot file is: PBE
 Please make sure this is what you need
 dft_functional readin is: hse
 dft_functional in pseudopot file is: PBE
 Please make sure this is what you need
 dft_functional readin is: hse
 dft_functional in pseudopot file is: PBE
 Please make sure this is what you need

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Warning: the number of valence electrons in pseudopotential > 3 for Ga: [Ar] 3d10 4s2 4p1
 Warning: the number of valence electrons in pseudopotential > 5 for Sb: [Kr] 4d10 5s2 5p3
 Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
 If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 UNIFORM GRID DIM        : 180 * 192 * 180
 UNIFORM GRID DIM(BIG)   : 45 * 48 * 45
 DONE(1.24341    SEC) : SETUP UNITCELL
 DONE(1.24544    SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  NBASE       
 1       42              1           1836        
 ---------------------------------------------------------
 Use Systematically Improvable Atomic bases
 ---------------------------------------------------------
 ELEMENT ORBITALS        NBASE       NATOM       XC          
 Ga      2s2p2d1f-8au    25          48          
 N       2s2p1d-7au      13          47          
 Sb      2s2p2d1f-7au    25          1           
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(2.5721     SEC) : INIT PLANEWAVE
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
terminate called after throwing an instance of 'cereal::Exception'
  what():  Failed to read 8 bytes from input stream! Read 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 25778 RUNNING AT bohrium-13341-1125055
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

You can check the result in the attchement:

GaSbN96_hse_debug.zip

Thanks for the help @WHUweiqingzhou .

floatingCatty commented 6 months ago

@PeizeLin Hello, do you know if there is any progress on this?

I have tried computing the same system on the Shuguang platform, and almost the same error has been raised: image

floatingCatty commented 6 months ago

Hello, I have tried using 512GB memory devices, but similar error have been raised. image image

For extra information, I've tried using just conventional cell with the same setting to compute, it runs successfully. So is it come from some memory leak problem?

maki49 commented 6 months ago

@PeizeLin FYI I reproduced this error on my workstation (with 1proc and 32 threads) and located the error here in LibComm: https://github.com/abacusmodeling/LibComm/blob/ec984514b44480e98bd1578bcacca7a19c849724/include/Comm/Comm_Trans/Comm_Trans.hpp#L215

                iar(key, value);

Then I set KPT to 1 1 1 without kspacing, annother error occured here in isend_data (before recv_data): https://github.com/abacusmodeling/LibComm/blob/ec984514b44480e98bd1578bcacca7a19c849724/include/Comm/Comm_Trans/Comm_Trans.hpp#L139

MPI_CHECK (MPI_Isend (str_isend.c_str(), str_isend.size(), MPI_CHAR, rank_isend, Comm_Trans::tag_data, this->mpi_comm, &request_isend));
Abort(604617474) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(160): MPI_Isend(buf=0x7f4e10e18010, count=-1625393416, MPI_CHAR, dest=0, tag=0, MPI_COMM_WORLD, request=0x205eaf9e0) failed
PMPI_Isend(91).: Negative count, value is -1625393416

which seems the same problem as #2934.

May this error in recv_data results from the previous out-of-bound in isend_data? I updated LibComm to the newest version of the master branch (where #2934 has been fixed), the cereal::Exception error changed to a segment fault here: https://github.com/abacusmodeling/LibComm/blob/9a127aadbd2575ff2c9789d0dda3677759df2a48/include/Comm/Comm_Trans/Comm_Trans.hpp#L87

MPI_CHECK (MPI_Improbe(MPI_ANY_SOURCE, this->tag_data, this->mpi_comm, &flag_iprobe, &message_recv, &status_recv));

@floatingCatty Maybe you can try 512GB memory devices after updating LibComm?
(compile with -DLIBRI_DIR=.../LibRI -DLIBCOMM_DIR=.../LibComm -DGIT_SUBMODULE=OFF to avoid auto-checkout)

PeizeLin commented 6 months ago

@floatingCatty According to the results @maki49 shown, which version of ABACUS, LibComm and Cereal do you use? It may help us to find the bug.

floatingCatty commented 6 months ago

@PeizeLin I am using ABACUS 3.6.1, and I am not very clear on how to check the LibComm and Cereal versions.

When I try to update LibComm as suggested by @maki49 (or just rebuild the abacus package without updating LibComm), the build fails consistently. I am working on the Bohritum machine, @WHUweiqingzhou found out that this failure of compiling might come from the abacus image being built with the wrong Dockerfile. The new image hasn't been produced yet, and when I communicated with the guy in charge, they said the compilation faced some package dependency problem.

So until now, building the abacus with updating LibComm on Bohrium is not available to me. I hope the new image can be produced soon and I will update LibComm in this case.

Thanks for the help.

floatingCatty commented 6 months ago

@PeizeLin I am using ABACUS 3.6.1, and I am not very clear on how to check the LibComm and Cereal versions.

When I try to update LibComm as suggested by @maki49 (or just rebuild the abacus package without updating LibComm), the build fails consistently. I am working on the Bohritum machine, @WHUweiqingzhou found out that this failure of compiling might come from the abacus image being built with the wrong Dockerfile. The new image hasn't been produced yet, and when I communicated with the guy in charge, they said the compilation faced some package dependency problem.

So until now, building the abacus with updating LibComm on Bohrium is not available to me. I hope the new image can be produced soon and I will update LibComm in this case.

Thanks for the help.

I tried to use the new image compiling with Dockerfile. intel, it showed a similar error when I built the abacus: image

I am not very sure whether this build abacus was successful, but after this, I tried computing the same task on a c64m512 machine, and it raised this error: image

maki49 commented 6 months ago

@PeizeLin I am using ABACUS 3.6.1, and I am not very clear on how to check the LibComm and Cereal versions.

cd abacus-develop/deps/LibComm
git log

and check the first line.

WHUweiqingzhou commented 6 months ago

@floatingCatty, thanks to @maki49, I tried the latest LibComm, and find this error is indeed fixed , and finish a full iteration:

---------------------------------------------------------
 Use Systematically Improvable Atomic bases
 ---------------------------------------------------------
 ELEMENT ORBITALS        NBASE       NATOM       XC          
 Ga      2s2p2d1f-8au    25          48          
 N       2s2p1d-7au      13          47          
 Sb      2s2p2d1f-7au    25          1           
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(2.70033    SEC) : INIT PLANEWAVE
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(745.509    SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 GE1    -1.142658e+05  0.000000e+00   1.343e-01  6.963e+01  

Please add -DLIBRI_DIR=/root/abacus-develop/deps/LibRI -DLIBCOMM_DIR=/root/abacus-develop/deps/LibComm -DGIT_SUBMODULE=OFF, otherwise LibComm will checkout to old version automatically.

floatingCatty commented 6 months ago

Hello, all @WHUweiqingzhou @maki49 @PeizeLin I have successfully updated the LibComm under careful assistance of @WHUweiqingzhou and tried to compute the system again on 512GB memory devices. It works well. Thank you again for the helps.