aiidateam / aiida-quantumespresso

The official AiiDA plugin for Quantum ESPRESSO
https://aiida-quantumespresso.readthedocs.io
Other
53 stars 78 forks source link

Check if using symlinks is enough for a QE restart #565

Closed giovannipizzi closed 3 years ago

giovannipizzi commented 3 years ago

Currently while cp.x uses symlinks (since it uses two different files for input and output), for pw.x we do a full copy since e.g. a NSCF will overwrite some files (wave functions etc.).

However, depending on how QE does this, it might be enough to do a symlink (e.g. if it just renames/deletes the files, rather than overwriting or appending).

Similar investigation might be useful for the post processing tools (pp.x, pw2wannier90.x, ...). This will help with

  1. speed in restarts from big folders (now it sometimes takes minutes in the upload phase, e.g. for restarts from ncsf with a lot of kpoints)
  2. possibly much better use of disk space on the scratch of supercomputers

This is a follow-up of the discussion on aiidateam/aiida-core#4417

Pinging @ramirezfranciscof - maybe at some point you could do a couple of tests of this with some of your simulations (both NSCF from SCF, and PP restarts).

giovannipizzi commented 3 years ago

Note: to test this we can use (on Linux, not on Mac, but this should be OK at least for testing, since most HPC centres have linux OSs) the cp -Rs command (with some caveats, e.g. one needs to provide an absolute path as the source) that recreates the folder structure but creates symlinks rather than files: https://stackoverflow.com/questions/44059472/recursively-symlink-directory-tree

If we are confident that this is safe, we should then discuss in aiida-core how to support such new command (as an 'alternative' to remote_copy_list and remote_symlink_list, ideally in a backward-compatible way). E.g. reopening aiidateam/aiida-core#4417 or opening a new issue (since that was more related to hard links rather than recursive soft links for files only, rather than directories).

greschd commented 3 years ago

If we are confident that this is safe, we should then discuss in aiida-core how to support such new command (as an 'alternative' to remote_copy_list and remote_symlink_list, ideally in a backward-compatible way).

I don't understand this entirely - can't we just use remote_symlink_list instead of remote_copy_list if the operation is safe?

giovannipizzi commented 3 years ago

No, what I meant is that it might be safe to symlink each single file, but definitely not the full out folder with a single symlink - and in the remote_copy_list we don't have an option to recursively create symlinks (nor we know without going to the remote the actual number and name of files)

giovannipizzi commented 3 years ago

By the way, I just checked, and it does not work :-(

If I do a single symlink to the output folder, of course the content gets changed. But even if I symlink each file and recreate the folder structure, only new files are generated, but existing files are instead overwritten in place rather than recreated. So I get e.g. this diff (SCF of silicon with 2x2x2 grid, then NSCF on 4x4x4 grid by recursive symlinking with cp -Rs:

$ diff -r scf scf-copy-2
diff -r scf/out/aiida.save/data-file.xml scf-copy-2/out/aiida.save/data-file.xml
894c894
<          8
---
>          3
897c897
<     <MONKHORST_PACK_GRID nk1="4" nk2="4" nk3="4"/>
---
>     <MONKHORST_PACK_GRID nk1="2" nk2="2" nk3="2"/>
899,906c899,901
<     <K-POINT.1 XYZ="0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000" WEIGHT="3.125000000000000E-002"/>
<     <K-POINT.2 XYZ="-1.767766952966369E-001 1.767766952966369E-001 1.767766952966369E-001" WEIGHT="2.500000000000000E-001"/>
<     <K-POINT.3 XYZ="3.535533905932738E-001 -3.535533905932738E-001 -3.535533905932738E-001" WEIGHT="1.250000000000000E-001"/>
<     <K-POINT.4 XYZ="0.000000000000000E+000 0.000000000000000E+000 3.535533905932738E-001" WEIGHT="1.875000000000000E-001"/>
<     <K-POINT.5 XYZ="5.303300858899107E-001 -5.303300858899107E-001 -1.767766952966369E-001" WEIGHT="7.500000000000000E-001"/>
<     <K-POINT.6 XYZ="3.535533905932738E-001 -3.535533905932738E-001 0.000000000000000E+000" WEIGHT="3.750000000000000E-001"/>
<     <K-POINT.7 XYZ="0.000000000000000E+000 0.000000000000000E+000 -7.071067811865476E-001" WEIGHT="9.375000000000000E-002"/>
<     <K-POINT.8 XYZ="0.000000000000000E+000 3.535533905932738E-001 -7.071067811865476E-001" WEIGHT="1.875000000000000E-001"/>
---
>     <K-POINT.1 XYZ="0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000" WEIGHT="2.500000000000000E-001"/>
>     <K-POINT.2 XYZ="3.535533905932738E-001 -3.535533905932738E-001 -3.535533905932738E-001" WEIGHT="1.000000000000000E+000"/>
>     <K-POINT.3 XYZ="0.000000000000000E+000 0.000000000000000E+000 -7.071067811865476E-001" WEIGHT="7.500000000000000E-001"/>
942c937
<          8
---
>          3
962c957
<  2.388618736821756E-001
---
>  2.389777815185282E-001
971c966
<  3.125000000000000E-002
---
>  2.500000000000000E-001
979c974
< -1.767766952966369E-001  1.767766952966369E-001  1.767766952966369E-001
---
>  3.535533905932738E-001 -3.535533905932738E-001 -3.535533905932738E-001
982c977
<  2.500000000000000E-001
---
>  1.000000000000000E+000
990c985
<  3.535533905932738E-001 -3.535533905932738E-001 -3.535533905932738E-001
---
>  0.000000000000000E+000  0.000000000000000E+000 -7.071067811865476E-001
993c988
<  1.250000000000000E-001
---
>  7.500000000000000E-001
999,1053d993
<     <K-POINT.4>
<       <K-POINT_COORDS type="real" size="3" columns="3">
<  0.000000000000000E+000  0.000000000000000E+000  3.535533905932738E-001
<       </K-POINT_COORDS>
<       <WEIGHT type="real" size="1">
<  1.875000000000000E-001
<       </WEIGHT>
<       <DATAFILE iotk_link="./K00004/eigenval.xml">
<         <!--This is a link to the file indicated in the iotk_link attribute-->
<       </DATAFILE>
<     </K-POINT.4>
<     <K-POINT.5>
<       <K-POINT_COORDS type="real" size="3" columns="3">
<  5.303300858899107E-001 -5.303300858899107E-001 -1.767766952966369E-001
<       </K-POINT_COORDS>
<       <WEIGHT type="real" size="1">
<  7.500000000000000E-001
<       </WEIGHT>
<       <DATAFILE iotk_link="./K00005/eigenval.xml">
<         <!--This is a link to the file indicated in the iotk_link attribute-->
<       </DATAFILE>
<     </K-POINT.5>
<     <K-POINT.6>
<       <K-POINT_COORDS type="real" size="3" columns="3">
<  3.535533905932738E-001 -3.535533905932738E-001  0.000000000000000E+000
<       </K-POINT_COORDS>
<       <WEIGHT type="real" size="1">
<  3.750000000000000E-001
<       </WEIGHT>
<       <DATAFILE iotk_link="./K00006/eigenval.xml">
<         <!--This is a link to the file indicated in the iotk_link attribute-->
<       </DATAFILE>
<     </K-POINT.6>
<     <K-POINT.7>
<       <K-POINT_COORDS type="real" size="3" columns="3">
<  0.000000000000000E+000  0.000000000000000E+000 -7.071067811865476E-001
<       </K-POINT_COORDS>
<       <WEIGHT type="real" size="1">
<  9.375000000000000E-002
<       </WEIGHT>
<       <DATAFILE iotk_link="./K00007/eigenval.xml">
<         <!--This is a link to the file indicated in the iotk_link attribute-->
<       </DATAFILE>
<     </K-POINT.7>
<     <K-POINT.8>
<       <K-POINT_COORDS type="real" size="3" columns="3">
<  0.000000000000000E+000  3.535533905932738E-001 -7.071067811865476E-001
<       </K-POINT_COORDS>
<       <WEIGHT type="real" size="1">
<  1.875000000000000E-001
<       </WEIGHT>
<       <DATAFILE iotk_link="./K00008/eigenval.xml">
<         <!--This is a link to the file indicated in the iotk_link attribute-->
<       </DATAFILE>
<     </K-POINT.8>
1071,1090d1010
<        754
<       </NUMBER_OF_GK-VECTORS>
<     </K-POINT.3>
<     <K-POINT.4>
<       <NUMBER_OF_GK-VECTORS type="integer" size="1">
<        729
<       </NUMBER_OF_GK-VECTORS>
<     </K-POINT.4>
<     <K-POINT.5>
<       <NUMBER_OF_GK-VECTORS type="integer" size="1">
<        748
<       </NUMBER_OF_GK-VECTORS>
<     </K-POINT.5>
<     <K-POINT.6>
<       <NUMBER_OF_GK-VECTORS type="integer" size="1">
<        754
<       </NUMBER_OF_GK-VECTORS>
<     </K-POINT.6>
<     <K-POINT.7>
<       <NUMBER_OF_GK-VECTORS type="integer" size="1">
1093,1098c1013
<     </K-POINT.7>
<     <K-POINT.8>
<       <NUMBER_OF_GK-VECTORS type="integer" size="1">
<        744
<       </NUMBER_OF_GK-VECTORS>
<     </K-POINT.8>
---
>     </K-POINT.3>
diff -r scf/out/aiida.save/K00001/eigenval.xml scf-copy-2/out/aiida.save/K00001/eigenval.xml
10,13c10,13
< -2.027656782660340E-001
<  2.388618718824626E-001
<  2.388618731540466E-001
<  2.388618736821756E-001
---
> -2.026766002242699E-001
>  2.389777791336783E-001
>  2.389777793557880E-001
>  2.389777815185282E-001
diff -r scf/out/aiida.save/K00002/eigenval.xml scf-copy-2/out/aiida.save/K00002/eigenval.xml
10,13c10,13
< -1.732448631202525E-001
<  9.288179851004172E-002
<  2.100024722581097E-001
<  2.100024726309614E-001
---
> -1.162904219815684E-001
> -1.975193424401684E-002
>  1.937984240115659E-001
>  1.937984247964259E-001
diff -r scf/out/aiida.save/K00003/eigenval.xml scf-copy-2/out/aiida.save/K00003/eigenval.xml
10,13c10,13
< -1.163919032870081E-001
< -1.983284207955713E-002
<  1.936927390442563E-001
<  1.936927394715175E-001
---
> -4.961207376678188E-002
> -4.961207347793863E-002
>  1.316940348744310E-001
>  1.316940354699567E-001
Binary files scf/out/aiida.wfc1 and scf-copy-2/out/aiida.wfc1 differ
Binary files scf/out/aiida.wfc2 and scf-copy-2/out/aiida.wfc2 differ
Binary files scf/out/aiida.wfc3 and scf-copy-2/out/aiida.wfc3 differ
Binary files scf/out/aiida.wfc4 and scf-copy-2/out/aiida.wfc4 differ
Binary files scf/out/aiida.wfc5 and scf-copy-2/out/aiida.wfc5 differ
Binary files scf/out/aiida.wfc6 and scf-copy-2/out/aiida.wfc6 differ
Binary files scf/out/aiida.wfc7 and scf-copy-2/out/aiida.wfc7 differ
Binary files scf/out/aiida.wfc8 and scf-copy-2/out/aiida.wfc8 differ
giovannipizzi commented 3 years ago

(scf is the folder target of the symlinks, scf-copy-2 a backup copy for reference)

greschd commented 3 years ago

Unfortunate.. but I guess we can close this, then?

giovannipizzi commented 3 years ago

BTW, I also tried with hard linking, and indeed (as @greschd had pointed out) i get the same behaviour as with symlinks, i.e. the source files are modified :-(

Should we open a feature request with Quantum ESPRESSO, or in the end we don't care that we need to copy a lot of files at every resubmit? (This is problematic for huge systems with a lot of kpoints and wavefunctions, I have transfers of ~100GBs that take tens of minutes).

greschd commented 3 years ago

I mean, it's worth getting the opinion of QE developers at least.

This can of course also be solved at the filesystem level, but I don't know if there is a parallel FS out there that does deduplication.

greschd commented 3 years ago

@giovannipizzi do you know if pw2wannier90.x overwrites the QE files, or if symlinking would be appropriate there?

giovannipizzi commented 3 years ago

Good question. I think it doesn't - it should just read them and write the new files?

greschd commented 3 years ago

Ok, I think we should check and use symlinks if it indeed doesn't. The copy can take quite a while there.