Design Considerations - Githubissues

azadoks commented 2 years ago

For the moment, I've basically translated Simon Pintarelli's upf_to_json Python package with testing to ensure numerical agreement.

Some rough to-do's are:

[x] Implement parsing for UPF v1 & v2
[x] Decide on datastructures
[x] Work on types (StaticArrays? Unitful?)
[x] Documentation
[ ] FileIO integration (we have magic bytes, why not)
[x] Add support for psp8 (and previous versions?) -- package could need a re-naming
[ ] Implement save?

mfherbst commented 2 years ago

Some thoughts:

Add support for psp8 (and previous versions?) -- package could need a re-naming

Could be a different package? Or do you want to make a generic pseudo parser? We could add our HGH code from DFTK as well?

On that note, I wonder whether it would make sense to actually make this package a generic pseudopotential datastructure implementation (similar to the generic NormConservingPseudo interface we have in DFTK) that also does the parsing?

mfherbst commented 2 years ago

Unitful for this use case I'm not so sure. I would say this is pretty internal, so better just stick to atomic units everywhere

azadoks commented 2 years ago

Could be a different package? Or do you want to make a generic pseudo parser? We could add our HGH code from DFTK as well? On that note, I wonder whether it would make sense to actually make this package a generic pseudopotential datastructure implementation (similar to the generic NormConservingPseudo interface we have in DFTK) that also does the parsing?

The question was more towards the second point. I agree that if we can iron out the details for a fairly generic datastructure implementation, it could be useful to have a library for supporting various pseudo formats (like libpspio but readable).

Given how things are done generally in the community, having a PseudopotentaialsBase.jl and then UPF.jl and PSP.jl could also be reasonable (but then the number of packages would be a bit bigger with the XML PAW format that PseudoDojo/Abinit uses, PSML, and HGH that's already in DFTK).

Unitful for this use case I'm not so sure. I would say this is pretty internal, so better just stick to atomic units everywhere

Yeah, this seems reasonable. I should definitely document all the units that are used.

azadoks commented 2 years ago

w.r.t. datastructures, I'm a bit stuck on how best to proceed. Various formats support different kinds of information that don't always completely line up; one could massage each separately to get a more homogenous representation while giving up a more direct mapping to the file contents. On the other hand, having a separate type for each fileformat would mean better correspondence with how things are laid out in the files, but we'd need to rely much more on a well-defined interface (which wouldn't be all bad).

Does anyone have thoughts on which option might be better / another proposal entirely?

mfherbst commented 1 year ago

On the other hand, having a separate type for each fileformat would mean better correspondence with how things are laid out in the files, but we'd need to rely much more on a well-defined interface (which wouldn't be all bad).

Certainly the way I would go. In DFTK we currently have a mixture mainly for pragmatic reasons that it takes less time than designing the interface properly. But then it's also hard to generalise from the one example that we currently have in the code. If we aim broader here I think there is chances of success.

My suggestion would be to not worry about this top down, but rather implement a few parsers bottom-up and then try to find the common features and turn them into a function-based and types-based interface.

Happy to discuss by the way also via zoom etc. I'm still travelling this week, but next week should be better.

azadoks commented 1 year ago

Over in #7 I've implemented much more thorough parsers for UPF and PSP8, both of which produce structs for all the myriad fields in the files. I've tried to keep the structs mostly one-to-one with the file contents because it's much easier to find errors, add new sections, etc.

The parsers themselves are fairly thoroughly tested now, but I'm trying to write tests for the functional interface for common quantities, and I'm running into really poor performance and a fair bit of code duplication.

It seems a bit tricky to balance the two design goals I set:

one-to-one correspondence with the file (i.e. no real calculations should be done at load, and the struct should have only data from the file)
a purely functional interface for quantities like projectors, local potential, atomic charge densities, etc. and their Fourier transforms

The problem is that, for the two formats that I've parsed so far, the quantities in-file are either subtly different (projectors * r vs true projectors in UPF vs PSP8) or completely different (numeric vs. analytical forms in UPF vs HGH).

Subtle differences like in PSP8 vs UPF mean that a majority of the code must be duplicated and slightly tweaked for the two formats if we keep the struct-file correspondence from point 1.

Big differences like in HGH vs UPF/PSP8, combined with restrictions from point 1, mean that all pseudos should implement functions like local_potential_real(psp, r) which are cheap to evaluate for HGH but expensive for UPF/PSP8 because an interpolator is reconstructed each time the function is called and shouldn't be stashed with the data.

One possible solution I've been thinking about is to split the data representations in two levels: FormatFile (e.g. UpfFile) structs which are one-to-one representations of the data in-file and FormalismPsP (e.g. NNCPsP -- numeric norm-conserving, ANCPsP -- analytical norm-conserving). This has a few benefits:

Parsers only need to worry about reading data
You have direct access to the raw data
The interface for the quantities can be cleanly implemented with little code duplication because all data are standardized in the *PsP structs

And a few drawbacks:

Code complexity is possibly a bit higher (possibly not because all special cases are condensed in the *File -> *PsP constructors)
In some future cases, standardizing the quantities may be impossible
The number of structs will be quite high, probably including:
1. AbstractNCPsP: abstract norm-conserving
2. NNCPsP: numeric norm-conserving
3. ANCPsP: analytical norm-conserving
4. USPsP: ultrasoft (only numeric)
5. PAWPsP: PAW (only numeric)
6. ?? SLPsP: semilocal (only numeric)
7. ?? CoulombPsP: coulomb (only numeric)

mfherbst commented 1 year ago

Subtle differences like in PSP8 vs UPF mean that a majority of the code must be duplicated and slightly tweaked for the two formats if we keep the struct-file correspondence from point 1.

On that point: I think there is little you can do about code duplication in this case without making it super difficult to understand what is happening. Best do document which parts of the duplicated codes have similar versions in other files. That ensures that if you later refactor you don't forget to do it elsewhere

azadoks / PseudoPotentialIO.jl

Design Considerations #2