deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
174 stars 136 forks source link

LibComm v0.1.1 fails to run EXX cases in parallel on my machine #5436

Closed maki49 closed 2 weeks ago

maki49 commented 2 weeks ago

Describe the bug

After #5428, I compiled the newest ABACUS with nothing different than before, except for cloning LibRI and LibComm and manually set their directories. However, all the EXX cases I've tried to run with multiple processors failed:

 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(0.0504186  SEC) : INIT PLANEWAVE
Abort(470409475) on node 0 (rank 0 in comm 0): Fatal error in internal_Isend: Unknown error class, error stack:
internal_Isend(30152): MPI_Isend(buf=0x7f3840001a80, count=11144, INVALID DATATYPE, 1, 0, MPI_COMM_WORLD, request=0x3e07474) failed
internal_Isend(30098): Invalid datatype

No such error if I reset LibComm's commit from 55ea39ed2916f31335c3ed469619505aa716c92a to 6975a5f50ebed7ab9b5a3eee35a8324a97833c44.

maki49 commented 2 weeks ago

My PR #5435 changes nothing EXX-related, but some of the EXX cases failed on CI... https://github.com/deepmodeling/abacus-develop/actions/runs/11729854060/job/32676380872

maki49 commented 2 weeks ago

@PeizeLin does not meet such problems with the same environment:

There might be some problem on my machine. I will close this issue. One can reopen this when meeting the same problem.