aiidalab / aiidalab-qe

AiiDAlab App for Quantum ESPRESSO
https://aiidalab-qe.readthedocs.io/
MIT License
13 stars 15 forks source link

XPS spectra not calculated #630

Closed bio15 closed 8 months ago

bio15 commented 9 months ago

Dear QE-team,

I would like to use the QE app to calculate XPS spectra. But the workflow is not finishing

I follow the Tutorial on: https://aiidalab-qe.readthedocs.io/howto/xps.html but I never reach step 4, as my simulation stops before. I am trying to run on daint with gpus. This is the result I get. grafik

Best

Simon

superstar54 commented 9 months ago

Hi @bio15 , Could you inspect the process in the terminal, by

# list all process
verdi process list -a
# find the XpsWorkChain and its pk (number)
verdi process report pk

and show share all output here?

You can also report other pk, like the failed. PwCalculation process.

bio15 commented 9 months ago

Hi @superstar54, here is the list of processes for one run: grafik There is no XpsWorkChain. But the other is: (base) jovyan@fc3faa2ad794:~$ verdi process report 1140 2024-03-01 13:23:24 [338 | REPORT]: [1140|QeAppWorkChain|run_relax]: launching PwRelaxWorkChain<1142> 2024-03-01 13:23:25 [339 | REPORT]: [1142|PwRelaxWorkChain|run_relax]: launching PwBaseWorkChain<1145> 2024-03-01 13:23:26 [340 | REPORT]: [1145|PwBaseWorkChain|run_process]: launching PwCalculation<1150> iteration #1 2024-03-01 13:26:46 [345 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: PwCalculation<1150> failed with exit status 305: Both the stdout and XML output files could not be read or parsed. 2024-03-01 13:26:46 [346 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting... 2024-03-01 13:26:46 [347 | REPORT]: [1145|PwBaseWorkChain|inspect_process]: PwCalculation<1150> failed but a handler detected an unrecoverable problem, aborting 2024-03-01 13:26:46 [348 | REPORT]: [1145|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned 2024-03-01 13:26:46 [349 | REPORT]: [1142|PwRelaxWorkChain|inspect_relax]: relax PwBaseWorkChain failed with exit status300 2024-03-01 13:26:46 [350 | REPORT]: [1142|PwRelaxWorkChain|on_terminated]: remote folders will not be cleaned 2024-03-01 13:26:47 [351 | REPORT]: [1140|QeAppWorkChain|inspect_relax]: PwRelaxWorkChain failed with exit status 401 2024-03-01 13:26:47 [352 | REPORT]: [1140|QeAppWorkChain|on_terminated]: remote folders will not be cleaned

verdi process report 1142 2024-03-01 13:23:25 [339 | REPORT]: [1142|PwRelaxWorkChain|run_relax]: launching PwBaseWorkChain<1145> 2024-03-01 13:23:26 [340 | REPORT]: [1145|PwBaseWorkChain|run_process]: launching PwCalculation<1150> iteration #1 2024-03-01 13:26:46 [345 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: PwCalculation<1150> failed with exit status 305: Both the stdout and XML output files could not be read or parsed. 2024-03-01 13:26:46 [346 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting... 2024-03-01 13:26:46 [347 | REPORT]: [1145|PwBaseWorkChain|inspect_process]: PwCalculation<1150> failed but a handler detected an unrecoverable problem, aborting 2024-03-01 13:26:46 [348 | REPORT]: [1145|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned 2024-03-01 13:26:46 [349 | REPORT]: [1142|PwRelaxWorkChain|inspect_relax]: relax PwBaseWorkChain failed with exit status 300 2024-03-01 13:26:46 [350 | REPORT]: [1142|PwRelaxWorkChain|on_terminated]: remote folders will not be cleaned

verdi process report 1145 2024-03-01 13:23:26 [340 | REPORT]: [1145|PwBaseWorkChain|run_process]: launching PwCalculation<1150> iteration #1 2024-03-01 13:26:46 [345 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: PwCalculation<1150> failed with exit status 305: Both the stdout and XML output files could not be read or parsed. 2024-03-01 13:26:46 [346 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting... 2024-03-01 13:26:46 [347 | REPORT]: [1145|PwBaseWorkChain|inspect_process]: PwCalculation<1150> failed but a handler detected an unrecoverable problem, aborting 2024-03-01 13:26:46 [348 | REPORT]: [1145|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned

1150: None Scheduler output:

Batch Job Summary Report (version 21.01.1) for Job "aiida-1150" (52051020) on daint

Job information (1/3)

         Submit            Eligible               Start                 End    Elapsed Time limit

2024-03-01T14:24:49 2024-03-01T14:24:49 2024-03-01T14:24:59 2024-03-01T14:25:24 00:00:25 12:00:00

Job information (2/3)

Username      Account    Partition   NNodes        Energy

sgramatt      emshare       normal        1      3.476 kJ

Job information (3/3) - GPU utilization data

Node name Usage Max mem Execution time


nid04173        15 %    16255 MiB       00:00:16

Scheduler errors: Switching to atp/3.14.5. Switching to cray-mpich/7.7.18. Switching to craype/2.7.10. Switching to modules/3.2.11.4. Switching to nvidia/21.3. Switching to perftools-base/21.09.0. Switching to pmi/5.0.17. OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711936 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711936 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711936 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711936 bytes requested; status = 2(out of memory) 0: ALLOCATE: 32158080 bytes requested; status = 2(out of memory) 0: ALLOCATE: 32158080 bytes requested; status = 2(out of memory) srun: error: nid04173: tasks 0-6,8-11: Exited with exit code 127 srun: launch/slurm: _step_signal: Terminating StepId=52051020.0 slurmstepd: error: STEP 52051020.0 ON nid04173 CANCELLED AT 2024-03-01T14:25:23 *** srun: error: nid04173: task 7: Terminated srun: Force Terminated StepId=52051020.0

*** 4 LOG MESSAGES: +-> WARNING at 2024-03-01 13:26:46.119489+00:00 | key 'symmetries' is not present in raw output dictionary +-> ERROR at 2024-03-01 13:26:46.235993+00:00 | ERROR_OUTPUT_STDOUT_INCOMPLETE +-> ERROR at 2024-03-01 13:26:46.244600+00:00 | Both the stdout and XML output files could not be read or parsed. +-> WARNING at 2024-03-01 13:26:46.249707+00:00 | output parser returned exit code<305>: Both the stdout and XML output files could not be read or parsed.

Thanks for your help

Simon

superstar54 commented 9 months ago

thanks for the output. Could you go to daint by

verdi calcjob gotocompuer 1150

there will be aiida.in and aiida.out files, could you show the content of the files?

bio15 commented 9 months ago

Here is the content of the files: &CONTROL calculation = 'vc-relax' etot_conv_thr = 1.4000000000d-03 forc_conv_thr = 1.0000000000d-03 max_seconds = 4.1040000000d+04 outdir = './out/' prefix = 'aiida' pseudo_dir = './pseudo/' tprnfor = .true. tstress = .true. verbosity = 'high' / &SYSTEM ecutrho = 4.8000000000d+02 ecutwfc = 6.0000000000d+01 ibrav = 0 nat = 14 nosym = .false. ntyp = 2 occupations = 'fixed' tot_charge = 0.0000000000d+00 vdw_corr = 'none' / &ELECTRONS conv_thr = 5.6000000000d-09 electron_maxstep = 80 mixing_beta = 4.0000000000d-01 / &IONS / &CELL cell_dofree = 'all' press_conv_thr = 5.0000000000d-01 / ATOMIC_SPECIES C 12.011 C.pbe-n-kjpaw_psl.1.0.0.UPF H 1.008 H.pbe-rrkjus_psl.1.0.0.UPF ATOMIC_POSITIONS angstrom C 8.8758000000 7.1581000000 5.0001000000 C 8.1783000000 8.3661000000 5.0001000000 C 8.1783000000 5.9500000000 5.0001000000 C 6.7835000000 8.3663000000 5.0001000000 C 6.7834000000 5.9502000000 5.0001000000 C 6.0861000000 7.1583000000 5.0001000000 C 10.3048000000 7.1581000000 5.0001000000 C 11.5075000000 7.1584000000 5.0001000000 H 8.7075000000 9.3161000000 5.0000000000 H 8.7075000000 5.0000000000 5.0001000000 H 6.2403000000 9.3068000000 5.0001000000 H 6.2401000000 5.0098000000 5.0001000000 H 5.0000000000 7.1583000000 5.0002000000 H 12.5724000000 7.1585000000 5.0002000000 K_POINTS automatic 1 1 2 0 0 0 CELL_PARAMETERS angstrom 17.5724000000 0.0000000000 0.0000000000 0.0000000000 14.3161000000 0.0000000000 0.0000000000 0.0000000000 10.0002000000

And: ogram PWSCF v.7.2 starts on 1Mar2024 at 14:25: 8

 This program is part of the open-source Quantum ESPRESSO suite
 for quantum simulation of materials; please cite
     "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
     "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
     "P. Giannozzi et al., J. Chem. Phys. 152 154105 (2020);
      URL http://www.quantum-espresso.org", 
 in publications or presentations arising from this work. More details at
 http://www.quantum-espresso.org/quote

 Parallel version (MPI & OpenMP), running on      12 processor cores
 Number of MPI processes:                12
 Threads/MPI process:                     1

 MPI processes distributed on     1 nodes
 58588 MiB available memory on the printing compute node when the environment starts

 Reading input from aiida.in

 Current dimensions of program PWSCF are:
 Max number of different atomic species (ntypx) = 10
 Max number of k-points (npk) =  40000
 Max angular momentum in pseudopotentials (lmaxx) =  4
 file C.pbe-n-kjpaw_psl.1.0.0.UPF: wavefunction(s)  2S 2P renormalized

 R & G space division:  proc/nbgrp/npool/nimage =      12
 Subspace diagonalization in iterative solution of the eigenvalue problem:
 a serial algorithm will be used

 Parallelization info
 --------------------
 sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW
 Min        2859    1429    371               251235    88837   11829
 Max        2861    1430    373               251236    88840   11830
 Sum       34319   17155   4463              3014825  1066063  141957

 Using Slab Decomposition

 bravais-lattice index     =            0
 lattice parameter (alat)  =      33.2070  a.u.
 unit-cell volume          =   16977.0056 (a.u.)^3
 number of atoms/cell      =           14
 number of atomic types    =            2
 number of electrons       =        38.00
 number of Kohn-Sham states=           19
 kinetic-energy cutoff     =      60.0000  Ry
 charge density cutoff     =     480.0000  Ry
 scf convergence threshold =      5.6E-09
 mixing beta               =       0.4000
 number of iterations used =            8  plain     mixing
 energy convergence thresh.=      1.4E-03
 force convergence thresh. =      1.0E-03
 press convergence thresh. =      5.0E-01
 Exchange-correlation= PBE
                       (   1   4   3   4   0   0   0)
 nstep                     =           50

 GPU acceleration is ACTIVE.

 Message from routine print_cuda_info:
 High GPU oversubscription detected. Are you sure this is what you want?

 GPU used by master process:

    Device Number: 0
    Device name: Tesla P100-PCIE-16GB
    Compute capability : 60
    Ratio of single to double precision performance  : 2
    Memory Clock Rate (KHz): 715000
    Memory Bus Width (bits): 4096
    Peak Memory Bandwidth (GB/s): 732.16

 celldm(1)=  33.207023  celldm(2)=   0.000000  celldm(3)=   0.000000
 celldm(4)=   0.000000  celldm(5)=   0.000000  celldm(6)=   0.000000

 crystal axes: (cart. coord. in units of alat)
           a(1) = (   1.000000   0.000000   0.000000 )  
           a(2) = (   0.000000   0.814692   0.000000 )  
           a(3) = (   0.000000   0.000000   0.569086 )  

 reciprocal axes: (cart. coord. in units 2 pi/alat)
           b(1) = (  1.000000  0.000000  0.000000 )  
           b(2) = (  0.000000  1.227457  0.000000 )  
           b(3) = (  0.000000  0.000000  1.757205 )  

 PseudoPot. # 1 for C  read from file:
 ./pseudo/C.pbe-n-kjpaw_psl.1.0.0.UPF
 MD5 check sum: 5d2aebdfa2cae82b50a7e79e9516da0f
 Pseudo is Projector augmented-wave + core cor, Zval =  4.0
 Generated using "atomic" code by A. Dal Corso  v.5.1
 Shape of augmentation charge: PSQ
 Using radial grid of 1073 points,  4 beta functions with: 
            l(1) =   0
            l(2) =   0
            l(3) =   1
            l(4) =   1
 Q(r) pseudized with 0 coefficients 

 PseudoPot. # 2 for H  read from file:
 ./pseudo/H.pbe-rrkjus_psl.1.0.0.UPF
 MD5 check sum: f52b6d4d1c606e5624b1dc7b2218f220
 Pseudo is Ultrasoft, Zval =  1.0
 Generated using "atomic" code by A. Dal Corso  v.5.1
 Using radial grid of  929 points,  2 beta functions with: 
            l(1) =   0
            l(2) =   0
 Q(r) pseudized with 0 coefficients 

 atomic species   valence    mass     pseudopotential
    C              4.00    12.01100     C ( 1.00)
    H              1.00     1.00800     H ( 1.00)

 No symmetry found

                                s                        frac. trans.

  isym =  1     identity                                     

cryst. s( 1) = ( 1 0 0 ) ( 0 1 0 ) ( 0 0 1 )

cart. s( 1) = ( 1.0000000 0.0000000 0.0000000 ) ( 0.0000000 1.0000000 0.0000000 ) ( 0.0000000 0.0000000 1.0000000 )

Cartesian axes

 site n.     atom                  positions (alat units)
     1           C   tau(   1) = (   0.5050989   0.4073490   0.2845428  )
     2           C   tau(   2) = (   0.4654060   0.4760932   0.2845428  )
     3           C   tau(   3) = (   0.4654060   0.3385992   0.2845428  )
     4           C   tau(   4) = (   0.3860315   0.4761046   0.2845428  )
     5           C   tau(   5) = (   0.3860258   0.3386105   0.2845428  )
     6           C   tau(   6) = (   0.3463443   0.4073604   0.2845428  )
     7           C   tau(   7) = (   0.5864196   0.4073490   0.2845428  )
     8           C   tau(   8) = (   0.6548622   0.4073661   0.2845428  )
     9           H   tau(   9) = (   0.4955214   0.5301552   0.2845371  )
    10           H   tau(  10) = (   0.4955214   0.2845371   0.2845428  )
    11           H   tau(  11) = (   0.3551194   0.5296260   0.2845428  )
    12           H   tau(  12) = (   0.3551080   0.2850948   0.2845428  )
    13           H   tau(  13) = (   0.2845371   0.4073604   0.2845485  )
    14           H   tau(  14) = (   0.7154629   0.4073718   0.2845485  )

Crystallographic axes

 site n.     atom                  positions (cryst. coord.)
     1           C   tau(   1) = (  0.5050989  0.5000035  0.5000000  )
     2           C   tau(   2) = (  0.4654060  0.5843840  0.5000000  )
     3           C   tau(   3) = (  0.4654060  0.4156160  0.5000000  )
     4           C   tau(   4) = (  0.3860315  0.5843980  0.5000000  )
     5           C   tau(   5) = (  0.3860258  0.4156300  0.5000000  )
     6           C   tau(   6) = (  0.3463443  0.5000175  0.5000000  )
     7           C   tau(   7) = (  0.5864196  0.5000035  0.5000000  )
     8           C   tau(   8) = (  0.6548622  0.5000244  0.5000000  )
     9           H   tau(   9) = (  0.4955214  0.6507429  0.4999900  )
    10           H   tau(  10) = (  0.4955214  0.3492571  0.5000000  )
    11           H   tau(  11) = (  0.3551194  0.6500933  0.5000000  )
    12           H   tau(  12) = (  0.3551080  0.3499417  0.5000000  )
    13           H   tau(  13) = (  0.2845371  0.5000175  0.5000100  )
    14           H   tau(  14) = (  0.7154629  0.5000314  0.5000100  )

 number of k points=     2
                   cart. coord. in units 2pi/alat
    k(    1) = (   0.0000000   0.0000000   0.0000000), wk =   1.0000000
    k(    2) = (   0.0000000   0.0000000  -0.8786024), wk =   1.0000000

                   cryst. coord.
    k(    1) = (   0.0000000   0.0000000   0.0000000), wk =   1.0000000
    k(    2) = (   0.0000000   0.0000000  -0.5000000), wk =   1.0000000

 Dense  grid:  3014825 G-vectors     FFT dimensions: ( 240, 192, 135)

 Smooth grid:  1066063 G-vectors     FFT dimensions: ( 180, 135,  96)

 Dynamical RAM for                 wfc:       3.22 MB

 Dynamical RAM for     wfc (w. buffer):       9.66 MB

 Dynamical RAM for           str. fact:       7.67 MB

 Dynamical RAM for           local pot:       0.00 MB

 Dynamical RAM for          nlocal pot:      12.88 MB

 Dynamical RAM for                qrad:       2.01 MB

 Dynamical RAM for          rho,v,vnew:      24.16 MB

 Dynamical RAM for               rhoin:       8.05 MB

 Dynamical RAM for            rho*nmix:      61.34 MB

 Dynamical RAM for           G-vectors:      15.05 MB

 Dynamical RAM for          h,s,v(r/c):       0.07 MB

 Dynamical RAM for          <psi|beta>:       0.02 MB

 Dynamical RAM for                 psi:       6.44 MB

 Dynamical RAM for                hpsi:       6.44 MB

 Dynamical RAM for                spsi:       6.44 MB

 Dynamical RAM for      wfcinit/wfcrot:      12.91 MB

 Dynamical RAM for           addusdens:     195.51 MB

 Dynamical RAM for          addusforce:     251.10 MB

 Dynamical RAM for         addusstress:     191.68 MB

 Estimated static dynamical RAM per process >     112.07 MB

 Estimated max dynamical RAM per process >     424.51 MB

 Estimated total dynamical RAM >       4.97 GB

 Initial potential from superposition of free atoms

 starting charge      37.9996, renormalised to      38.0000

 negative rho (up, down):  6.895E-04 0.000E+00
 Starting wfcs are   38 randomized atomic wfcs
 Checking if some PAW data can be deallocated... 
   PAW data deallocated on    6 nodes for type:  1

 total cpu time spent up to now is       13.2 secs

 per-process dynamical memory:   710.1 Mb

 Self-consistent Calculation

[tb_dev] Currently allocated 2.54E+00 Mbytes, locked: 0 / 10 [tb_pin] Currently allocated 0.00E+00 Mbytes, locked: 0 / 0

 iteration #  1     ecut=    60.00 Ry     beta= 0.40
 Davidson diagonalization with overlap

---- Real-time Memory Report at c_bands before calling an iterative solver 921 MiB given to the printing process from OS 699 MiB allocation reported by mallinfo(arena+hblkhd) 51110 MiB available memory on the node where the printing process lives GPU memory used/free/total (MiB): 13337 / 2943 / 16280

 ethr =  1.00E-02,  avg # of iterations =  2.0

 negative rho (up, down):  9.346E-04 0.000E+00
superstar54 commented 9 months ago

It seems the out of memory, could you try to use 2 or 4 nodes in the Step 3.

bio15 commented 9 months ago

I used 4 nodes, still the same errors... Maybe we can try it together in our next meeting.

superstar54 commented 8 months ago

After meeting with @bio15 in person, we tried this example in a fresh docker container. There is no issue anymore. I can not reproduce the error for the moment, so I closed it. I will open it if we encounter it again.