Exa-sCI / QuantumEnvelope

7 stars 11 forks source link

Running with mpirun hangs #46

Open anbenali opened 2 years ago

anbenali commented 2 years ago

When Running with: python3 ../../main.py nh3.1det.fcidump nh3.1det.wf 100 I get the following within 10 seconds:

Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775

However, when running with mpirun, system hangs (at least 5min). mpirun -n 8 python3 ../../main.py nh3.1det.fcidump nh3.1det.wf 100 CPU is busy though:


    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                              
  10633 abenali   20   0 1750856   1.3g  36500 R 100.0   4.1   7:12.84 python3                                                                                                                                                              
  10635 abenali   20   0 1751112   1.3g  36224 R 100.0   4.3   7:12.86 python3                                                                                                                                                              
  10636 abenali   20   0 1751112   1.3g  36252 R 100.0   4.2   7:12.38 python3                                                                                                                                                              
  10637 abenali   20   0 1750856   1.3g  36308 R 100.0   4.2   7:12.77 python3                                                                                                                                                              
  10638 abenali   20   0 1605704   1.3g  36244 R 100.0   4.1   7:12.83 python3                                                                                                                                                              
  10639 abenali   20   0 1750856   1.3g  36508 R 100.0   4.2   7:12.82 python3                                                                                                                                                              
  10640 abenali   20   0 1605448   1.3g  36432 R 100.0   4.1   7:12.49 python3                                                                                                                                                              
  10634 abenali   20   0 1604936   1.2g  36432 R  99.3   3.7   7:12.84 python3                                                                                                                                                              

And then when the calculation is done we get the following:

abenali@abenali:~/Work/src/QuantumEnvelope/data/test$ mpirun -n 8 python3 ../../main.py nh3.1det.fcidump nh3.1det.wf 100
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865

As you can see, the print out is just jammed and is not produced at each iteration by the master but all ranks at the same time.

This is obviously from here (main.py):


    while len(psi_det) < N_det_target:
        E, psi_coef, psi_det = selection_step(comm, lewis, n_ord, psi_coef, psi_det, len(psi_det))
        # Update Hamiltonian engine
        lewis = Hamiltonian_generator(
            comm, E0, d_one_e_integral, d_two_e_integral, psi_det, driven_by=driven_by
        )
        print(f"N_det: {len(psi_det)}, E {E}")
seanr7 commented 2 years ago

Is there a reason why the "Davidson Failed, fallback to numpy eigh" error is being raised here? I don't get this on my machine for the same test case.