elcorto / pwtools

pwtools is a Python package for pre- and postprocessing of atomistic calculations, mostly targeted to Quantum Espresso, CPMD, CP2K and LAMMPS. It is almost, but not quite, entirely unlike ASE, with some tools extending numpy/scipy. It has a set of powerful parsers and data types for storing calculation data.
https://elcorto.github.io/pwtools
BSD 3-Clause "New" or "Revised" License
63 stars 12 forks source link

Question about reusing dcd code #1

Closed patrickmelix closed 5 years ago

patrickmelix commented 5 years ago

Hey,

since I could not find any other contact I'll try like this:

I've been using your pwtools a lot for reading the dcd files produced by CP2K. Since it seems you're one of the few to have knowledge about the exact file format used by CP2K I am asking you:

Would you allow me to use your file format information from pwtools/dcd.py to implement a direct reader in ASE and PLAMS?

Your tool has been working fine for me now for a long time, but always skipping between modules creates some problems over time (not taking into account the inefficiency every time I read a dcd file). Since ASE is now a pretty well developed piece that a lot of people use, I would like it to be able to read CP2Ks dcd files natively there. If you're not happy with the licencing of PLAMS I can stick with ASE, that would already be a big help.

Thanks for considering! Cheers, Patrick

elcorto commented 5 years ago

Hi Patrick

Sure, I'd be glad to see this functionality be implemented in ASE and/or PLAMS :)

Since both projects use the LGPL and pwtools uses the 3-Clause BSD license, mentioning pwtools in accordance to the BSD license should suffice.

Regarding the DCD file format: Much of the credit actually goes to the VMD folks since I used their DCD plugin as a reference [1,2]. I also talked to the CP2K devs when developing the parser, which was a big help as well [3]. Details of the format, VMD's implementation and how CP2K writes DCD is in src/dcd.f90. These are the Fortran sources for the first parser I wrote (the Python wrapper is pwtools.dcd.read_dcd_data_f()), but make sure to use the faster Python-only version in pwtools.dcd.read_dcd_data().

Also note that you need to use DCD_ALIGNED_CELL [4] when dealing with non-orthorhombic cells (e.g. MD in ensembles that allow cell deformations such as NPT). DCD outputs only Cartesian atom coords and cell parameters (a,b,c,alpha,beta,gamma) and not cell vectors, which can lead to a mismatch between atom coords and the cell if they are not aligned correctly. This thread [5] has a lot more details.

best, Steve

[1] http://www.ks.uiuc.edu/Research/vmd/plugins/molfile/dcdplugin.html [2] http://www.ks.uiuc.edu/Research/namd/wiki/?ReadingDCDinFortran [3] https://groups.google.com/forum/#!searchin/cp2k/reading$20dcd$20files%7Csort:date/cp2k/L6W4Dl4xxP4/G2AEcp1ukfcJ [4] https://manual.cp2k.org/cp2k-6_1-branch/CP2K_INPUT/MOTION/PRINT/TRAJECTORY.html#FORMAT [5] https://groups.google.com/forum/#!searchin/cp2k/Outputting$20cell$20information$20and$20fractional$20coordinates%7Csort:relevance/cp2k/72MhYykrSrQ/5c9Jaw7S9yQJ

patrickmelix commented 5 years ago

Hey Steve,

awesome! I'll then put this on my ToDo list and mention here when its done. Don't expect it right away, but it was annoying me a lot lately so there's hope.

My plan is to only support the DCD_ALIGNED_CELL format of CP2K, exactly because of the discussion you mentioned in [5]. Might put in a check for the CP2K string in the header to prevent users from thinking it can read other dcd formats too. Not sure if there's a nice way of checking if it really is aligned or not, will check the CP2K source for that then.

Thanks for all the support and your great tool! Cheers, Patrick

elcorto commented 5 years ago

This is somewhat off-topic, but IIRC it is not possible to check, just by using the data, whether the file was generated after first rotating ("aligning") the coords. That is precisely the biggest flaw of this and all other formats that don't store cell vectors.

In summary, problems with DCD are:

Personally, I'd like like to see all MD codes abandon crude, undocumented formats such as DCD and instead agree on modern efficient binary containers such as H5MD [1]. I haven't used it myself, but it looks like a step in the right direction. It uses HDF5 as backend (a very good idea) and seems to be flexible enough to support all data needed to represent an MD trajectory.

[1] http://nongnu.org/h5md/

elcorto commented 5 years ago

I took the liberty to close this one as I think the question is answered. In any case, I'd like to hear about your progress with this, so feel free to drop news here or ping me at: git@elcorto.com - also in the LICENSE file now :)

patrickmelix commented 5 years ago

https://gitlab.com/ase/ase/merge_requests/1109

Just opened a MR.

elcorto commented 5 years ago

Nice one!

You mentioned that you don't have tests, which is not ideal. You should be able to generate a short MD trajectory with CP2K and use that for testing. I did exactly that: run 2x 15 steps of NPT MD, write once text (xyz format) and once DCD files and use them to compare. See this part of my CP2K parser tests, and the test data used. This test shows all the subtle gotchas which you encounter when dealing with DCD.

See also the other DCD test where I check the header against different parser implementations.

patrickmelix commented 5 years ago

Mmh, good idea... But ASE doesn't allow non-code files to be added to the repo. So I would have to run CP2K through ASE first. I'll have a look into that.

elcorto commented 5 years ago

Well, since the design of ASE's MD and optimization is to call only the calculator's SCF engine in each step and store the trajectory in ASE's trajectory format, you won't be able to write a native DCD containing a trajectory from within CP2K using ASE. But even if, since you are not allowed to add the generated data, this won't help.

You could, in theory, start a short native CP2K MD at test time and then parse, but I would not go down that route. It would prolong test time and it necessitates a working CP2K install for anybody running the test. Such a feature would be, in fact, a special-purpose calculator that allows you to run a full native MD using the calculator, which works against ASE's design. As such, it is much more work to maintain than a simple data file.

ASE has quite a number of parsers which read MD-like output (e.g. QE, LAMMPS, CASTEP, ...), even though it is not able to generate it -- so your contribution fits in there quite nicely.

I briefly checked ASE's test suite and they seem to test mainly the calculators (doing SCF, MD, optimization and so forth), which makes total sense since this is what ASE does best. I haven't found tests using canned trajectory data generated by external code (but I may have overlooked something, of course), so it looks like less test coverage in those parts of the code. Therefore, you'll probably get away with it :)

patrickmelix commented 5 years ago

Thats what I figured out so far too. Too bad, but lets see what the maintainers think about it. Thx for your help!

patrickmelix commented 5 years ago

Hey, just to let you know: The MR is approved and the reader is now part of ASE.

Thanks for letting me reuse your stuff! Cheers, Patrick

elcorto commented 5 years ago

Awesome! I just updated the README and the cp2k wiki in order to point people there :) Thanks for your work!