ValeevGroup / mpqc

The Massively Parallel Quantum Chemistry program, MPQC, computes properties of atoms and molecules from first principles using the time independent Schrödinger equation.
66 stars 24 forks source link

lock issue with Ref #24

Closed pchong90 closed 10 years ago

pchong90 commented 10 years ago

When I use uncontracted ABS to do CABS_Single calculation, the pt2r12 program sometimes goes into a lock and stuck at computing integrals. This problem happens when specific input file and method was used. I will show one example I have here. The code is on my repository. The input file is available here Using the command pt2r12 -prefix ethene.cc-pVTZ.1Ag -cabs cc-pVTZ-F12-CABS -dfbs cc-pVQZ-RI -r12 false -singles true -partitionH dyall_1 -cabs_contraction false the program will stuck at a lock. Here is part of the backtrace information

#0  0x00007fff88a65746 in __psynch_mutexwait ()
#1  0x00007fff8dc80779 in _pthread_mutex_lock ()
#2  0x0000000100fba68f in sc::RefCount::lock_ptr (this=0x10511a890) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/util/ref/ref.cc:128
#3  0x000000010001083a in sc::RefCount::reference (this=0x10511a890) at ref.h:275
#4  0x0000000100fbac67 in sc::RefBase::reference (this=0x7fff5fbedad8, p=0x10511a890) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/util/ref/ref.cc:262
#5  0x0000000100749954 in Ref (this=0x7fff5fbedad8, a=0x10511a000) at ref.h:374
#6  0x0000000100747b7d in Ref (this=0x7fff5fbedad8, a=0x10511a000) at ref.h:376
#7  0x00000001007449d4 in sc::CoreIntsEngine<libint2::FmEval_Chebyshev3>::instance<unsigned int> (mmax=16) at core_ints_engine.h:61
#8  0x000000010079ad6f in OSAR_CoreInts (this=0x106fbe640, mmax=16, params=@0x7fff5fbef148) at tbosar.h:64
#9  0x0000000100797193 in OSAR_CoreInts (this=0x106fbe640, mmax=16, params=@0x7fff5fbef148) at tbosar.h:65
#10 0x00000001007955ad in TwoBodyOSARLibint2 (this=0x106fbe440, integral=0x104f9ff10, b1=@0x104f9ff18, b2=@0x104f9ff28, b3=@0x104f9ff18, b4=@0x104f9ff28, storage=0, oper_params=@0x7fff5fbef148) at tbosar.h:331
#11 0x00000001007952ce in TwoBodyOSARLibint2 (this=0x106fbe440, integral=0x104f9ff10, b1=@0x104f9ff18, b2=@0x104f9ff28, b3=@0x104f9ff18, b4=@0x104f9ff28, storage=0, oper_params=@0x7fff5fbef148) at tbosar.h:404
#12 0x000000010076d2fe in sc::libint2::Int2eCreator<sc::TwoBodyOSARLibint2<(sc::TwoBodyOper::type)0> >::operator() (this=0x7fff5fbeea38, integral=0x104f9ff10, b1=@0x104f9ff18, b2=@0x104f9ff28, b3=@0x104f9ff18, b4=@0x104f9ff28, storage=0, params=@0x7fff5fbef148) at tbintlibint2.h:198
#13 0x000000010076c2e2 in BoundsLibint2 (this=0x106fbe3e0, integral=0x104f9ff10, b1=@0x104f9ff18, b2=@0x104f9ff28, b3=@0x104f9ff38, b4=@0x7fff5fbef000, storage=0, params=@0x7fff5fbef148) at bounds.timpl.h:73
#14 0x00000001007666c8 in TwoBodyThreeCenterIntLibint2 (this=0x106fbe310, integral=0x104f9ff10, b1=@0x104f9ff18, b2=@0x104f9ff28, b3=@0x104f9ff38, storage=0, int2etype=sc::TwoBodyOperSet::ERI, params=@0x7fff5fbef148) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/chemistry/qc/libint2/tbintlibint2.cc:300
#15 0x000000010076650e in TwoBodyThreeCenterIntLibint2 (this=0x106fbe310, integral=0x104f9ff10, b1=@0x104f9ff18, b2=@0x104f9ff28, b3=@0x104f9ff38, storage=0, int2etype=sc::TwoBodyOperSet::ERI, params=@0x7fff5fbef148) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/chemistry/qc/libint2/tbintlibint2.cc:385
#16 0x00000001007576bc in sc::IntegralLibint2::electron_repulsion3 (this=0x104f9ff10) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/chemistry/qc/libint2/libint2.cc:399
#17 0x000000010030496f in sc::detail::ERIEvalCreator<3>::eval (factory=0x104f9ff10, params=@0x10dc26d50) at integral.h:682
#18 0x0000000100304927 in sc::TwoBodyIntTraits<3, (sc::TwoBodyOperSet::type)0>::eval (factory=@0x10dc26d40, params=@0x10dc26d50) at inttraits.h:151
#19 0x0000000100304718 in sc::TwoBodyNCenterIntDescr<3, (sc::TwoBodyOperSet::type)0>::inteval (this=0x10dc26d30) at intdescr.h:117
#20 0x000000010032a013 in sc::TwoBodyThreeCenterMOIntsTransform_ijR::compute_pjR (this=0x106fbe130) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/chemistry/qc/lcao/transform_ijR.cc:563
#21 0x0000000100328ebc in sc::TwoBodyThreeCenterMOIntsTransform_ijR::compute (this=0x106fbe130) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/chemistry/qc/lcao/transform_ijR.cc:235
#22 0x00000001002b76e9 in sc::detail::coulomb_df (df_info=@0x104f2d9d8, P=@0x104f2da68, brabs=@0x104ed45d8, ketbs=@0x104ed45d8, obs=@0x104f2da38) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/chemistry/qc/lcao/fockbuilder.cc:1091
#23 0x00000001002a9f9c in TwoBodyFockMatrixDFBuilder (this=0x10dc245f0, compute_F=false, compute_J=true, compute_K=true, brabasis=@0x104ed45d8, ketbasis=@0x104ed45d8, densitybasis=@0x104f2da38, density=@0x104f2da68, openshelldensity=@0x104f2da78, df_info=@0x104f2d9d8, psqrtregistry=@0x104f2da98) at fockbuilder.h:585
#24 0x0000000100297b8f in TwoBodyFockMatrixDFBuilder (this=0x10dc245f0, compute_F=false, compute_J=true, compute_K=true, brabasis=@0x104ed45d8, ketbasis=@0x104ed45d8, densitybasis=@0x104f2da38, density=@0x104f2da68, openshelldensity=@0x104f2da78, df_info=@0x104f2d9d8, psqrtregistry=@0x104f2da98) at fockbuilder.h:610
#25 0x000000010028e326 in sc::FockBuildRuntime::get (this=0x104f2d9d0, key=@0x7fff5fbf40a0) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/chemistry/qc/lcao/fockbuild_runtime.cc:335
#26 0x0000000100c4b3fd in sc::SingleReference_R12Intermediates<double>::xy (this=0x106da53f8, key=@0x7fff5fbf7910) at sr_r12intermediates_util.h:452
#27 0x0000000100c4a21c in sc::SingleReference_R12Intermediates<double>::_2 (this=0x106da53f8, key=@0x7fff5fbf7910) at sr_r12intermediates_util.h:582
#28 0x0000000100952136 in sc::PT2R12::_2 (this=0x1058c2e00, key=@0x7fff5fbf7910) at pt2r12.h:416
#29 0x000000010090c22f in sc::PT2R12::cabs_singles_Dyall (this=0x1058c2e00) at /Users/ChongPeng/Workspace/Development/source/mpqc_dev/src/lib/chemistry/qc/mbptr12/pt2r12.cc:1749

The program on this example works fine when I use contracted CABS -cabs_contraction true. Also, if I set macro NLOCKS at src/lib/util/ref/ref.cc from 251 to 241, it works again on this example.

I haven't tested this example on other computers. However, this problem happened to different input on Blueridge.

evaleev commented 10 years ago

also: does the test pass if the shell environment variable MAD_NUM_THREADS is set to 1?

pchong90 commented 10 years ago

Yes, I am running the program in one thread and the problem still exists.

MADNESS runtime initialized with 0 threads in the pool and affinity -1 -1 -1
evaleev commented 10 years ago

This is related to the changeset 67fe92423dcbefe7562df629c4a496d27135a7ce ... basically this is related to the finite pool of locks available to lock smart pointers in mpqc.

The solution is to introduce atomic accesses to the smart pointers, a la std::shared_ptr. This is a loaded issue, still thinking about how to best go about this.

calewis commented 10 years ago

Switch to shared_ptr? On May 28, 2014 10:55 PM, "Eduard Valeyev" notifications@github.com wrote:

This is related to the changeset 67fe924https://github.com/ValeevGroup/mpqc/commit/67fe92423dcbefe7562df629c4a496d27135a7ce... basically this is related to the finite pool of locks available to lock smart pointers in mpqc.

The solution is to introduce atomic accesses to the smart pointers, a la std::shared_ptr. This is a loaded issue, still thinking about how to best go about this.

— Reply to this email directly or view it on GitHubhttps://github.com/ValeevGroup/mpqc/issues/24#issuecomment-44489360 .

justusc commented 10 years ago

There is an implementation of shared pointer in MADNESS that you could use as a basis. It is a port of Boost’s shared pointer. Then you can add the ref specific functionality on top. Or, you could inherit from std::shared_ptr directly and likewise add mpqc functionality (though I suspect this is discouraged).

Justus

On May 29, 2014, at 12:20 AM, Drew Lewis notifications@github.com wrote:

Switch to shared_ptr? On May 28, 2014 10:55 PM, "Eduard Valeyev" notifications@github.com wrote:

This is related to the changeset 67fe924https://github.com/ValeevGroup/mpqc/commit/67fe92423dcbefe7562df629c4a496d27135a7ce... basically this is related to the finite pool of locks available to lock smart pointers in mpqc.

The solution is to introduce atomic accesses to the smart pointers, a la std::shared_ptr. This is a loaded issue, still thinking about how to best go about this.

— Reply to this email directly or view it on GitHubhttps://github.com/ValeevGroup/mpqc/issues/24#issuecomment-44489360 .

— Reply to this email directly or view it on GitHub.

evaleev commented 10 years ago

fixed on taexp branch, please cherry-pick onto master.

this is just a workaround, with locking done explicitly outside the pointer.

switch to shared_ptr is a big jump.