Automatic parallelization of PW is broken

borellim commented 5 years ago

[x] run_init() crashes at PwCalculation.process() -- Addressed in #326
[x] XML schema validation fails with init-only calculations, because the XML file is incomplete (this error is only logged) -- Moved to #327
[x] The actual parser also fails for the same reason -- Moved to #327
[x] Finally: is the automatic parallelization algorithm still good? Moved to #328

Example of how to reproduce: aiida-quantumespresso workflow launch pw-base -X 9756 -s 9752 -p SSSP_eff_PBE_v1.1_b -a

Example errors (items 2 and 3 above; this is after patching item 1): failed_parse_init_pw_calc.txt

sphuber commented 5 years ago

I was just looking into this since I am rewriting the parser and I wanted to include support for the initialization calculations but I cannot even run an initialization run with pw.x 6.3 with new XML without it crashing. A normal calculation works fine with the compiled version I have on my work station, but with the initialization mode I get the following segmentation fault:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7FE5CE331E08
#1  0x7FE5CE330F90
#2  0x7FE5CDA624AF
#3  0x7FE5CDB7AFD5
#4  0x4DDB71 in __pw_restart_new_MOD_pw_write_schema at pw_restart_new.f90:158 (discriminator 22)
#5  0x4D43B0 in punch_ at punch.f90:73
#6  0x4F5219 in run_pwscf_ at run_pwscf.f90:120
#7  0x407AFA in MAIN__ at pwscf.f90:77
_aiidasubmit.sh: line 6: 21705 Segmentation fault      (core dumped) '/home/sphuber/code/qe/qe-6.3/bin/pw_new_xml.x' '-in' 'aiida.in' > 'aiida.out'

Unfortunately there is not more info then that. The standard output is interrupted: stdout_pw6.3.txt

Compared that of a successful initialization run with v6.1: stdout_pw6.1.txt

@borellim Have you been able to run an init run with >=6.3 and the new XML?

sphuber commented 5 years ago

@borellim I have opened a PR #326 that fixes the functionality for codes with the old XML. Since the failure of the new XML is a problem of the parser and not the workchain, I propose we close this with the referenced PR and open a new issue describing the problem of initialization runs with the new XML output. Since I am anyway reworking the parsing in the 3dd branch, I will take care of it there, because otherwise the merge conflicts would be pretty big

borellim commented 5 years ago

Sure, ok. I'm only concerned about the last point: "is the automatic parallelization algorithm still good?" Maybe the answer is yes for QE and no for QE+SIRIUS...

sphuber commented 5 years ago

Honestly, that completely depends on what "good" means. I just ported the original solution from Mounet's implementation in the legacy workflows. The heuristics are based on some benchmarks they did on Dora, I think some old CSCS machine. Also here I think it should go in a separate issue since this is a whole study in and of itself.

Edit: opened the separate issue #328

borellim commented 5 years ago

Agreed. Thanks a lot. OK to close this for me.

aiidateam / aiida-quantumespresso

Automatic parallelization of PW is broken #306