Closed floatingCatty closed 6 months ago
The system works well in my environment.
Could you add -DDEBUG_INFO=ON when cmake and rerun abacus, to find out which function raise the error in running_scf.log
?
The system works well in my environment. Could you add -DDEBUG_INFO=ON when cmake and rerun abacus, to find out which function raise the error in
running_scf.log
?
Hello, I tried build it according to your advice. But it shows that PkgConfig is missing. I am not sure if this is correct. I am running this on Bohrium machine with image registry.dp.tech/dptech/abacus:3.6.1, with ali machine and c64_m256_cpu. I hope this can help you to repeat this error.
This is the running_scf.log that generated during the failed task. running_scf.log
@floatingCatty and @PeizeLin,
I add -DDEBUG_INFO=ON
and compile successfully, @floatingCatty you can refer to /root/abacus-develop/Dockerfile.intel
to do this.
@PeizeLin I have reproduced the result at ali machine with c32_m64_cpu:
OMP_NUM_THREADS=32 mpirun -np 1 abacus |tee log
ABACUS v3.6.1
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: fcc381479 (Fri Apr 19 15:00:06 2024 +0800)
Thu Apr 25 13:35:52 2024
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : CPU / Intel(R) Xeon(R) Platinum
dft_functional readin is: hse
dft_functional in pseudopot file is: PBE
Please make sure this is what you need
dft_functional readin is: hse
dft_functional in pseudopot file is: PBE
Please make sure this is what you need
dft_functional readin is: hse
dft_functional in pseudopot file is: PBE
Please make sure this is what you need
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Warning: the number of valence electrons in pseudopotential > 3 for Ga: [Ar] 3d10 4s2 4p1
Warning: the number of valence electrons in pseudopotential > 5 for Sb: [Kr] 4d10 5s2 5p3
Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
UNIFORM GRID DIM : 180 * 192 * 180
UNIFORM GRID DIM(BIG) : 45 * 48 * 45
DONE(1.24341 SEC) : SETUP UNITCELL
DONE(1.24544 SEC) : INIT K-POINTS
---------------------------------------------------------
Self-consistent calculations for electrons
---------------------------------------------------------
SPIN KPOINTS PROCESSORS NBASE
1 42 1 1836
---------------------------------------------------------
Use Systematically Improvable Atomic bases
---------------------------------------------------------
ELEMENT ORBITALS NBASE NATOM XC
Ga 2s2p2d1f-8au 25 48
N 2s2p1d-7au 13 47
Sb 2s2p2d1f-7au 25 1
---------------------------------------------------------
Initial plane wave basis and FFT box
---------------------------------------------------------
DONE(2.5721 SEC) : INIT PLANEWAVE
-------------------------------------------
SELF-CONSISTENT :
-------------------------------------------
terminate called after throwing an instance of 'cereal::Exception'
what(): Failed to read 8 bytes from input stream! Read 0
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 25778 RUNNING AT bohrium-13341-1125055
= KILLED BY SIGNAL: 6 (Aborted)
===================================================================================
You can check the result in the attchement:
@floatingCatty and @PeizeLin,
I add
-DDEBUG_INFO=ON
and compile successfully, @floatingCatty you can refer to/root/abacus-develop/Dockerfile.intel
to do this.@PeizeLin I have reproduced the result at ali machine with c32_m64_cpu:
OMP_NUM_THREADS=32 mpirun -np 1 abacus |tee log
ABACUS v3.6.1 Atomic-orbital Based Ab-initio Computation at UStc Website: http://abacus.ustc.edu.cn/ Documentation: https://abacus.deepmodeling.com/ Repository: https://github.com/abacusmodeling/abacus-develop https://github.com/deepmodeling/abacus-develop Commit: fcc381479 (Fri Apr 19 15:00:06 2024 +0800) Thu Apr 25 13:35:52 2024 MAKE THE DIR : OUT.ABACUS/ RUNNING WITH DEVICE : CPU / Intel(R) Xeon(R) Platinum dft_functional readin is: hse dft_functional in pseudopot file is: PBE Please make sure this is what you need dft_functional readin is: hse dft_functional in pseudopot file is: PBE Please make sure this is what you need dft_functional readin is: hse dft_functional in pseudopot file is: PBE Please make sure this is what you need %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Warning: the number of valence electrons in pseudopotential > 3 for Ga: [Ar] 3d10 4s2 4p1 Warning: the number of valence electrons in pseudopotential > 5 for Sb: [Kr] 4d10 5s2 5p3 Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient. If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% UNIFORM GRID DIM : 180 * 192 * 180 UNIFORM GRID DIM(BIG) : 45 * 48 * 45 DONE(1.24341 SEC) : SETUP UNITCELL DONE(1.24544 SEC) : INIT K-POINTS --------------------------------------------------------- Self-consistent calculations for electrons --------------------------------------------------------- SPIN KPOINTS PROCESSORS NBASE 1 42 1 1836 --------------------------------------------------------- Use Systematically Improvable Atomic bases --------------------------------------------------------- ELEMENT ORBITALS NBASE NATOM XC Ga 2s2p2d1f-8au 25 48 N 2s2p1d-7au 13 47 Sb 2s2p2d1f-7au 25 1 --------------------------------------------------------- Initial plane wave basis and FFT box --------------------------------------------------------- DONE(2.5721 SEC) : INIT PLANEWAVE ------------------------------------------- SELF-CONSISTENT : ------------------------------------------- terminate called after throwing an instance of 'cereal::Exception' what(): Failed to read 8 bytes from input stream! Read 0 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 25778 RUNNING AT bohrium-13341-1125055 = KILLED BY SIGNAL: 6 (Aborted) ===================================================================================
You can check the result in the attchement:
Thanks for the help @WHUweiqingzhou .
@PeizeLin Hello, do you know if there is any progress on this?
I have tried computing the same system on the Shuguang platform, and almost the same error has been raised:
Hello, I have tried using 512GB memory devices, but similar error have been raised.
For extra information, I've tried using just conventional cell with the same setting to compute, it runs successfully. So is it come from some memory leak problem?
@PeizeLin FYI I reproduced this error on my workstation (with 1proc and 32 threads) and located the error here in LibComm: https://github.com/abacusmodeling/LibComm/blob/ec984514b44480e98bd1578bcacca7a19c849724/include/Comm/Comm_Trans/Comm_Trans.hpp#L215
iar(key, value);
Then I set KPT to 1 1 1
without kspacing, annother error occured here in isend_data
(before recv_data
):
https://github.com/abacusmodeling/LibComm/blob/ec984514b44480e98bd1578bcacca7a19c849724/include/Comm/Comm_Trans/Comm_Trans.hpp#L139
MPI_CHECK (MPI_Isend (str_isend.c_str(), str_isend.size(), MPI_CHAR, rank_isend, Comm_Trans::tag_data, this->mpi_comm, &request_isend));
Abort(604617474) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(160): MPI_Isend(buf=0x7f4e10e18010, count=-1625393416, MPI_CHAR, dest=0, tag=0, MPI_COMM_WORLD, request=0x205eaf9e0) failed
PMPI_Isend(91).: Negative count, value is -1625393416
which seems the same problem as #2934.
May this error in recv_data
results from the previous out-of-bound in isend_data
? I updated LibComm to the newest version of the master branch (where #2934 has been fixed), the cereal::Exception
error changed to a segment fault here:
https://github.com/abacusmodeling/LibComm/blob/9a127aadbd2575ff2c9789d0dda3677759df2a48/include/Comm/Comm_Trans/Comm_Trans.hpp#L87
MPI_CHECK (MPI_Improbe(MPI_ANY_SOURCE, this->tag_data, this->mpi_comm, &flag_iprobe, &message_recv, &status_recv));
@floatingCatty Maybe you can try 512GB memory devices after updating LibComm?
(compile with -DLIBRI_DIR=.../LibRI -DLIBCOMM_DIR=.../LibComm -DGIT_SUBMODULE=OFF
to avoid auto-checkout)
@floatingCatty According to the results @maki49 shown, which version of ABACUS, LibComm and Cereal do you use? It may help us to find the bug.
@PeizeLin I am using ABACUS 3.6.1, and I am not very clear on how to check the LibComm and Cereal versions.
When I try to update LibComm as suggested by @maki49 (or just rebuild the abacus package without updating LibComm), the build fails consistently. I am working on the Bohritum machine, @WHUweiqingzhou found out that this failure of compiling might come from the abacus image being built with the wrong Dockerfile. The new image hasn't been produced yet, and when I communicated with the guy in charge, they said the compilation faced some package dependency problem.
So until now, building the abacus with updating LibComm on Bohrium is not available to me. I hope the new image can be produced soon and I will update LibComm in this case.
Thanks for the help.
@PeizeLin I am using ABACUS 3.6.1, and I am not very clear on how to check the LibComm and Cereal versions.
When I try to update LibComm as suggested by @maki49 (or just rebuild the abacus package without updating LibComm), the build fails consistently. I am working on the Bohritum machine, @WHUweiqingzhou found out that this failure of compiling might come from the abacus image being built with the wrong Dockerfile. The new image hasn't been produced yet, and when I communicated with the guy in charge, they said the compilation faced some package dependency problem.
So until now, building the abacus with updating LibComm on Bohrium is not available to me. I hope the new image can be produced soon and I will update LibComm in this case.
Thanks for the help.
I tried to use the new image compiling with Dockerfile. intel, it showed a similar error when I built the abacus:
I am not very sure whether this build abacus was successful, but after this, I tried computing the same task on a c64m512 machine, and it raised this error:
@PeizeLin I am using ABACUS 3.6.1, and I am not very clear on how to check the LibComm and Cereal versions.
cd abacus-develop/deps/LibComm git log
and check the first line.
@floatingCatty,
thanks to @maki49, I tried the latest LibComm
, and find this error is indeed fixed , and finish a full iteration:
---------------------------------------------------------
Use Systematically Improvable Atomic bases
---------------------------------------------------------
ELEMENT ORBITALS NBASE NATOM XC
Ga 2s2p2d1f-8au 25 48
N 2s2p1d-7au 13 47
Sb 2s2p2d1f-7au 25 1
---------------------------------------------------------
Initial plane wave basis and FFT box
---------------------------------------------------------
DONE(2.70033 SEC) : INIT PLANEWAVE
-------------------------------------------
SELF-CONSISTENT :
-------------------------------------------
START CHARGE : atomic
DONE(745.509 SEC) : INIT SCF
ITER ETOT(eV) EDIFF(eV) DRHO TIME(s)
GE1 -1.142658e+05 0.000000e+00 1.343e-01 6.963e+01
Please add -DLIBRI_DIR=/root/abacus-develop/deps/LibRI -DLIBCOMM_DIR=/root/abacus-develop/deps/LibComm -DGIT_SUBMODULE=OFF
, otherwise LibComm
will checkout to old version automatically.
Hello, all @WHUweiqingzhou @maki49 @PeizeLin I have successfully updated the LibComm under careful assistance of @WHUweiqingzhou and tried to compute the system again on 512GB memory devices. It works well. Thank you again for the helps.
Describe the bug
When computing hse scf task on GaSbN system, it keeps raise this error:
Expected behavior
Make the calculation function normally.
To Reproduce
GaSbN96_hse.zip
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)