Closed bio15 closed 8 months ago
Hi @bio15 , Could you inspect the process in the terminal, by
# list all process
verdi process list -a
# find the XpsWorkChain and its pk (number)
verdi process report pk
and show share all output here?
You can also report other pk, like the failed. PwCalculation
process.
Hi @superstar54, here is the list of processes for one run: There is no XpsWorkChain. But the other is: (base) jovyan@fc3faa2ad794:~$ verdi process report 1140 2024-03-01 13:23:24 [338 | REPORT]: [1140|QeAppWorkChain|run_relax]: launching PwRelaxWorkChain<1142> 2024-03-01 13:23:25 [339 | REPORT]: [1142|PwRelaxWorkChain|run_relax]: launching PwBaseWorkChain<1145> 2024-03-01 13:23:26 [340 | REPORT]: [1145|PwBaseWorkChain|run_process]: launching PwCalculation<1150> iteration #1 2024-03-01 13:26:46 [345 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: PwCalculation<1150> failed with exit status 305: Both the stdout and XML output files could not be read or parsed. 2024-03-01 13:26:46 [346 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting... 2024-03-01 13:26:46 [347 | REPORT]: [1145|PwBaseWorkChain|inspect_process]: PwCalculation<1150> failed but a handler detected an unrecoverable problem, aborting 2024-03-01 13:26:46 [348 | REPORT]: [1145|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned 2024-03-01 13:26:46 [349 | REPORT]: [1142|PwRelaxWorkChain|inspect_relax]: relax PwBaseWorkChain failed with exit status300 2024-03-01 13:26:46 [350 | REPORT]: [1142|PwRelaxWorkChain|on_terminated]: remote folders will not be cleaned 2024-03-01 13:26:47 [351 | REPORT]: [1140|QeAppWorkChain|inspect_relax]: PwRelaxWorkChain failed with exit status 401 2024-03-01 13:26:47 [352 | REPORT]: [1140|QeAppWorkChain|on_terminated]: remote folders will not be cleaned
verdi process report 1142 2024-03-01 13:23:25 [339 | REPORT]: [1142|PwRelaxWorkChain|run_relax]: launching PwBaseWorkChain<1145> 2024-03-01 13:23:26 [340 | REPORT]: [1145|PwBaseWorkChain|run_process]: launching PwCalculation<1150> iteration #1 2024-03-01 13:26:46 [345 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: PwCalculation<1150> failed with exit status 305: Both the stdout and XML output files could not be read or parsed. 2024-03-01 13:26:46 [346 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting... 2024-03-01 13:26:46 [347 | REPORT]: [1145|PwBaseWorkChain|inspect_process]: PwCalculation<1150> failed but a handler detected an unrecoverable problem, aborting 2024-03-01 13:26:46 [348 | REPORT]: [1145|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned 2024-03-01 13:26:46 [349 | REPORT]: [1142|PwRelaxWorkChain|inspect_relax]: relax PwBaseWorkChain failed with exit status 300 2024-03-01 13:26:46 [350 | REPORT]: [1142|PwRelaxWorkChain|on_terminated]: remote folders will not be cleaned
verdi process report 1145 2024-03-01 13:23:26 [340 | REPORT]: [1145|PwBaseWorkChain|run_process]: launching PwCalculation<1150> iteration #1 2024-03-01 13:26:46 [345 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: PwCalculation<1150> failed with exit status 305: Both the stdout and XML output files could not be read or parsed. 2024-03-01 13:26:46 [346 | REPORT]: [1145|PwBaseWorkChain|report_error_handled]: Action taken: unrecoverable error, aborting... 2024-03-01 13:26:46 [347 | REPORT]: [1145|PwBaseWorkChain|inspect_process]: PwCalculation<1150> failed but a handler detected an unrecoverable problem, aborting 2024-03-01 13:26:46 [348 | REPORT]: [1145|PwBaseWorkChain|on_terminated]: remote folders will not be cleaned
1150: None Scheduler output:
Batch Job Summary Report (version 21.01.1) for Job "aiida-1150" (52051020) on daint
Submit Eligible Start End Elapsed Time limit
Username Account Partition NNodes Energy
sgramatt emshare normal 1 3.476 kJ
Node name Usage Max mem Execution time
nid04173 15 % 16255 MiB 00:00:16
Scheduler errors: Switching to atp/3.14.5. Switching to cray-mpich/7.7.18. Switching to craype/2.7.10. Switching to modules/3.2.11.4. Switching to nvidia/21.3. Switching to perftools-base/21.09.0. Switching to pmi/5.0.17. OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". OMP: Warning #234: OMP_NUM_THREADS: Invalid symbols found. Check the value "". 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711936 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711936 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711936 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711360 bytes requested; status = 2(out of memory) 0: ALLOCATE: 144711936 bytes requested; status = 2(out of memory) 0: ALLOCATE: 32158080 bytes requested; status = 2(out of memory) 0: ALLOCATE: 32158080 bytes requested; status = 2(out of memory) srun: error: nid04173: tasks 0-6,8-11: Exited with exit code 127 srun: launch/slurm: _step_signal: Terminating StepId=52051020.0 slurmstepd: error: STEP 52051020.0 ON nid04173 CANCELLED AT 2024-03-01T14:25:23 *** srun: error: nid04173: task 7: Terminated srun: Force Terminated StepId=52051020.0
*** 4 LOG MESSAGES: +-> WARNING at 2024-03-01 13:26:46.119489+00:00 | key 'symmetries' is not present in raw output dictionary +-> ERROR at 2024-03-01 13:26:46.235993+00:00 | ERROR_OUTPUT_STDOUT_INCOMPLETE +-> ERROR at 2024-03-01 13:26:46.244600+00:00 | Both the stdout and XML output files could not be read or parsed. +-> WARNING at 2024-03-01 13:26:46.249707+00:00 | output parser returned exit code<305>: Both the stdout and XML output files could not be read or parsed.
Thanks for your help
Simon
thanks for the output. Could you go to daint by
verdi calcjob gotocompuer 1150
there will be aiida.in
and aiida.out
files, could you show the content of the files?
Here is the content of the files: &CONTROL calculation = 'vc-relax' etot_conv_thr = 1.4000000000d-03 forc_conv_thr = 1.0000000000d-03 max_seconds = 4.1040000000d+04 outdir = './out/' prefix = 'aiida' pseudo_dir = './pseudo/' tprnfor = .true. tstress = .true. verbosity = 'high' / &SYSTEM ecutrho = 4.8000000000d+02 ecutwfc = 6.0000000000d+01 ibrav = 0 nat = 14 nosym = .false. ntyp = 2 occupations = 'fixed' tot_charge = 0.0000000000d+00 vdw_corr = 'none' / &ELECTRONS conv_thr = 5.6000000000d-09 electron_maxstep = 80 mixing_beta = 4.0000000000d-01 / &IONS / &CELL cell_dofree = 'all' press_conv_thr = 5.0000000000d-01 / ATOMIC_SPECIES C 12.011 C.pbe-n-kjpaw_psl.1.0.0.UPF H 1.008 H.pbe-rrkjus_psl.1.0.0.UPF ATOMIC_POSITIONS angstrom C 8.8758000000 7.1581000000 5.0001000000 C 8.1783000000 8.3661000000 5.0001000000 C 8.1783000000 5.9500000000 5.0001000000 C 6.7835000000 8.3663000000 5.0001000000 C 6.7834000000 5.9502000000 5.0001000000 C 6.0861000000 7.1583000000 5.0001000000 C 10.3048000000 7.1581000000 5.0001000000 C 11.5075000000 7.1584000000 5.0001000000 H 8.7075000000 9.3161000000 5.0000000000 H 8.7075000000 5.0000000000 5.0001000000 H 6.2403000000 9.3068000000 5.0001000000 H 6.2401000000 5.0098000000 5.0001000000 H 5.0000000000 7.1583000000 5.0002000000 H 12.5724000000 7.1585000000 5.0002000000 K_POINTS automatic 1 1 2 0 0 0 CELL_PARAMETERS angstrom 17.5724000000 0.0000000000 0.0000000000 0.0000000000 14.3161000000 0.0000000000 0.0000000000 0.0000000000 10.0002000000
And: ogram PWSCF v.7.2 starts on 1Mar2024 at 14:25: 8
This program is part of the open-source Quantum ESPRESSO suite
for quantum simulation of materials; please cite
"P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
"P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
"P. Giannozzi et al., J. Chem. Phys. 152 154105 (2020);
URL http://www.quantum-espresso.org",
in publications or presentations arising from this work. More details at
http://www.quantum-espresso.org/quote
Parallel version (MPI & OpenMP), running on 12 processor cores
Number of MPI processes: 12
Threads/MPI process: 1
MPI processes distributed on 1 nodes
58588 MiB available memory on the printing compute node when the environment starts
Reading input from aiida.in
Current dimensions of program PWSCF are:
Max number of different atomic species (ntypx) = 10
Max number of k-points (npk) = 40000
Max angular momentum in pseudopotentials (lmaxx) = 4
file C.pbe-n-kjpaw_psl.1.0.0.UPF: wavefunction(s) 2S 2P renormalized
R & G space division: proc/nbgrp/npool/nimage = 12
Subspace diagonalization in iterative solution of the eigenvalue problem:
a serial algorithm will be used
Parallelization info
--------------------
sticks: dense smooth PW G-vecs: dense smooth PW
Min 2859 1429 371 251235 88837 11829
Max 2861 1430 373 251236 88840 11830
Sum 34319 17155 4463 3014825 1066063 141957
Using Slab Decomposition
bravais-lattice index = 0
lattice parameter (alat) = 33.2070 a.u.
unit-cell volume = 16977.0056 (a.u.)^3
number of atoms/cell = 14
number of atomic types = 2
number of electrons = 38.00
number of Kohn-Sham states= 19
kinetic-energy cutoff = 60.0000 Ry
charge density cutoff = 480.0000 Ry
scf convergence threshold = 5.6E-09
mixing beta = 0.4000
number of iterations used = 8 plain mixing
energy convergence thresh.= 1.4E-03
force convergence thresh. = 1.0E-03
press convergence thresh. = 5.0E-01
Exchange-correlation= PBE
( 1 4 3 4 0 0 0)
nstep = 50
GPU acceleration is ACTIVE.
Message from routine print_cuda_info:
High GPU oversubscription detected. Are you sure this is what you want?
GPU used by master process:
Device Number: 0
Device name: Tesla P100-PCIE-16GB
Compute capability : 60
Ratio of single to double precision performance : 2
Memory Clock Rate (KHz): 715000
Memory Bus Width (bits): 4096
Peak Memory Bandwidth (GB/s): 732.16
celldm(1)= 33.207023 celldm(2)= 0.000000 celldm(3)= 0.000000
celldm(4)= 0.000000 celldm(5)= 0.000000 celldm(6)= 0.000000
crystal axes: (cart. coord. in units of alat)
a(1) = ( 1.000000 0.000000 0.000000 )
a(2) = ( 0.000000 0.814692 0.000000 )
a(3) = ( 0.000000 0.000000 0.569086 )
reciprocal axes: (cart. coord. in units 2 pi/alat)
b(1) = ( 1.000000 0.000000 0.000000 )
b(2) = ( 0.000000 1.227457 0.000000 )
b(3) = ( 0.000000 0.000000 1.757205 )
PseudoPot. # 1 for C read from file:
./pseudo/C.pbe-n-kjpaw_psl.1.0.0.UPF
MD5 check sum: 5d2aebdfa2cae82b50a7e79e9516da0f
Pseudo is Projector augmented-wave + core cor, Zval = 4.0
Generated using "atomic" code by A. Dal Corso v.5.1
Shape of augmentation charge: PSQ
Using radial grid of 1073 points, 4 beta functions with:
l(1) = 0
l(2) = 0
l(3) = 1
l(4) = 1
Q(r) pseudized with 0 coefficients
PseudoPot. # 2 for H read from file:
./pseudo/H.pbe-rrkjus_psl.1.0.0.UPF
MD5 check sum: f52b6d4d1c606e5624b1dc7b2218f220
Pseudo is Ultrasoft, Zval = 1.0
Generated using "atomic" code by A. Dal Corso v.5.1
Using radial grid of 929 points, 2 beta functions with:
l(1) = 0
l(2) = 0
Q(r) pseudized with 0 coefficients
atomic species valence mass pseudopotential
C 4.00 12.01100 C ( 1.00)
H 1.00 1.00800 H ( 1.00)
No symmetry found
s frac. trans.
isym = 1 identity
cryst. s( 1) = ( 1 0 0 ) ( 0 1 0 ) ( 0 0 1 )
cart. s( 1) = ( 1.0000000 0.0000000 0.0000000 ) ( 0.0000000 1.0000000 0.0000000 ) ( 0.0000000 0.0000000 1.0000000 )
Cartesian axes
site n. atom positions (alat units)
1 C tau( 1) = ( 0.5050989 0.4073490 0.2845428 )
2 C tau( 2) = ( 0.4654060 0.4760932 0.2845428 )
3 C tau( 3) = ( 0.4654060 0.3385992 0.2845428 )
4 C tau( 4) = ( 0.3860315 0.4761046 0.2845428 )
5 C tau( 5) = ( 0.3860258 0.3386105 0.2845428 )
6 C tau( 6) = ( 0.3463443 0.4073604 0.2845428 )
7 C tau( 7) = ( 0.5864196 0.4073490 0.2845428 )
8 C tau( 8) = ( 0.6548622 0.4073661 0.2845428 )
9 H tau( 9) = ( 0.4955214 0.5301552 0.2845371 )
10 H tau( 10) = ( 0.4955214 0.2845371 0.2845428 )
11 H tau( 11) = ( 0.3551194 0.5296260 0.2845428 )
12 H tau( 12) = ( 0.3551080 0.2850948 0.2845428 )
13 H tau( 13) = ( 0.2845371 0.4073604 0.2845485 )
14 H tau( 14) = ( 0.7154629 0.4073718 0.2845485 )
Crystallographic axes
site n. atom positions (cryst. coord.)
1 C tau( 1) = ( 0.5050989 0.5000035 0.5000000 )
2 C tau( 2) = ( 0.4654060 0.5843840 0.5000000 )
3 C tau( 3) = ( 0.4654060 0.4156160 0.5000000 )
4 C tau( 4) = ( 0.3860315 0.5843980 0.5000000 )
5 C tau( 5) = ( 0.3860258 0.4156300 0.5000000 )
6 C tau( 6) = ( 0.3463443 0.5000175 0.5000000 )
7 C tau( 7) = ( 0.5864196 0.5000035 0.5000000 )
8 C tau( 8) = ( 0.6548622 0.5000244 0.5000000 )
9 H tau( 9) = ( 0.4955214 0.6507429 0.4999900 )
10 H tau( 10) = ( 0.4955214 0.3492571 0.5000000 )
11 H tau( 11) = ( 0.3551194 0.6500933 0.5000000 )
12 H tau( 12) = ( 0.3551080 0.3499417 0.5000000 )
13 H tau( 13) = ( 0.2845371 0.5000175 0.5000100 )
14 H tau( 14) = ( 0.7154629 0.5000314 0.5000100 )
number of k points= 2
cart. coord. in units 2pi/alat
k( 1) = ( 0.0000000 0.0000000 0.0000000), wk = 1.0000000
k( 2) = ( 0.0000000 0.0000000 -0.8786024), wk = 1.0000000
cryst. coord.
k( 1) = ( 0.0000000 0.0000000 0.0000000), wk = 1.0000000
k( 2) = ( 0.0000000 0.0000000 -0.5000000), wk = 1.0000000
Dense grid: 3014825 G-vectors FFT dimensions: ( 240, 192, 135)
Smooth grid: 1066063 G-vectors FFT dimensions: ( 180, 135, 96)
Dynamical RAM for wfc: 3.22 MB
Dynamical RAM for wfc (w. buffer): 9.66 MB
Dynamical RAM for str. fact: 7.67 MB
Dynamical RAM for local pot: 0.00 MB
Dynamical RAM for nlocal pot: 12.88 MB
Dynamical RAM for qrad: 2.01 MB
Dynamical RAM for rho,v,vnew: 24.16 MB
Dynamical RAM for rhoin: 8.05 MB
Dynamical RAM for rho*nmix: 61.34 MB
Dynamical RAM for G-vectors: 15.05 MB
Dynamical RAM for h,s,v(r/c): 0.07 MB
Dynamical RAM for <psi|beta>: 0.02 MB
Dynamical RAM for psi: 6.44 MB
Dynamical RAM for hpsi: 6.44 MB
Dynamical RAM for spsi: 6.44 MB
Dynamical RAM for wfcinit/wfcrot: 12.91 MB
Dynamical RAM for addusdens: 195.51 MB
Dynamical RAM for addusforce: 251.10 MB
Dynamical RAM for addusstress: 191.68 MB
Estimated static dynamical RAM per process > 112.07 MB
Estimated max dynamical RAM per process > 424.51 MB
Estimated total dynamical RAM > 4.97 GB
Initial potential from superposition of free atoms
starting charge 37.9996, renormalised to 38.0000
negative rho (up, down): 6.895E-04 0.000E+00
Starting wfcs are 38 randomized atomic wfcs
Checking if some PAW data can be deallocated...
PAW data deallocated on 6 nodes for type: 1
total cpu time spent up to now is 13.2 secs
per-process dynamical memory: 710.1 Mb
Self-consistent Calculation
[tb_dev] Currently allocated 2.54E+00 Mbytes, locked: 0 / 10 [tb_pin] Currently allocated 0.00E+00 Mbytes, locked: 0 / 0
iteration # 1 ecut= 60.00 Ry beta= 0.40
Davidson diagonalization with overlap
ethr = 1.00E-02, avg # of iterations = 2.0
negative rho (up, down): 9.346E-04 0.000E+00
It seems the out of memory
, could you try to use 2 or 4 nodes in the Step 3.
I used 4 nodes, still the same errors... Maybe we can try it together in our next meeting.
After meeting with @bio15 in person, we tried this example in a fresh docker container. There is no issue anymore. I can not reproduce the error for the moment, so I closed it. I will open it if we encounter it again.
Dear QE-team,
I would like to use the QE app to calculate XPS spectra. But the workflow is not finishing
I follow the Tutorial on: https://aiidalab-qe.readthedocs.io/howto/xps.html but I never reach step 4, as my simulation stops before. I am trying to run on daint with gpus. This is the result I get.
Best
Simon