Wavefunction data - Githubissues

dgasmith commented 5 years ago

A request that we see is Wavefunction data, now that basis set specifications are beginning to conclude it would be a good time to begin discussing the layout. An example of a new top level field where the basis data and wavefunction data is provided.

Decisions to make here: 1) Should there be a new top level field for wavefunction-like data and what should it be called? Another option would be to place this data in the properties top level field, but this combines output that requires the full basis set specification and data that are simple variables. 2) Where should the basis information be located? There is some discussion that this could (eventually) be a JSON-LD like approach to a basis library (https://www.basissetexchange.org). However, until the basis information is uniform between programs this is not currently possible. 3) What are the fields that we should list? 4) Can we assume that for RHF wave functions, only alpha quantities are present? 5) Should we tackle matrix quantities that have symmetry (Fock matrices) to reduce footprint by a factor of ~2.

Remaining issues to work through:

Orbital ordering (#12)
Large array representations (#15)
How does the input change to request these quantities?

Example:

{
    # Schema headers
    'schema_name': 'qcschema_output',
    'schema_version': 1,

    # Minimal Input
    'molecule': {
        'symbols': ['He'],
        'geometry': [0.0, 0.0, 0.0]
    },
    'model': {
        'method': 'SCF',
        'basis': '6-31g'
    },
    'driver': 'energy',
    'keywords': {},

    # Standard output data
    'raw_output': None,
    'success': True,
    'provenance': {
        'creator': 'Psi4',
        'version': '1.3.2',
        'routine': 'psi4.json.run_json'
    },
    'return_result': -2.8551790335918543,
    'properties': {
        'calcinfo_nbasis': 2,
        'calcinfo_nmo': 2,
        'calcinfo_nalpha': 1,
        'calcinfo_nbeta': 1,
        'calcinfo_natom': 1,
        'return_energy': -2.8551790335918543,
        'nuclear_repulsion_energy': 0.0,
        'scf_one_electron_energy': -3.882050835689409,
        'scf_two_electron_energy': 1.0268718020975545,
        'scf_dipole_moment': [0.0, 0.0, 0.0],
        'scf_iterations': 4,
        'scf_total_energy': -2.8551790335918543
    },
    'wavefunction': {

        # Begin with a basis
        'basis': {
            'revision_description': 'Data from Gaussian 09/GAMESS',
            'elements': {
                '2': {
                    'electron_shells': [{
                        'function_type':
                        'gto',
                        'region':
                        'valence',
                        'angular_momentum': [0],
                        'exponents': ['0.3842163400E+02', '0.5778030000E+01', '0.1241774000E+01'],
                        'coefficients':
                        [['0.4013973935E-01', '0.2612460970E+00', '0.7931846246E+00']]
                    }, {
                        'function_type': 'gto',
                        'region': 'valence',
                        'angular_momentum': [0],
                        'exponents': ['0.2979640000E+00'],
                        'coefficients': [['1.0000000']]
                    }],
                    'references': [{
                        'reference_description': '31G Split-valence basis set for H,He',
                        'reference_keys': ['gaussian09e01']
                    }]
                }
            },
            'version': '1',
            'function_types': ['gto', 'gto_cartesian'],
            'names': ['6-31G'],
            'flags': [],
            'family': 'pople',
            'description': '6-31G valence double-zeta',
            'role': 'orbital',
            'auxiliaries': {},
            'name': '6-31G'
        },

        # The individual matrices
        'orbitals_a':
        [0.5920657102503524, 1.1498260600756343, 0.5136020636861139, -1.1869518498493001],
        'orbitals_b':
        [0.5920657102503524, 1.1498260600756343, 0.5136020636861139, -1.1869518498493001],
        'density_a':
        [0.3505418052542542, 0.30408617062236576, 0.30408617062236576, 0.26378707982263505],
        'density_b':
        [0.3505418052542542, 0.30408617062236576, 0.30408617062236576, 0.26378707982263505],
        'fock_a':
        [-0.549237187834466, -1.000373967977106, -1.000373967977106, -0.4292228146057008],
        'fock_b':
        [-0.549237187834466, -1.000373967977106, -1.000373967977106, -0.4292228146057008]
    }
}

sjrl commented 5 years ago

Should there be a new top level field for wavefunction-like data and what should it be called? Another option would be to place this data in the properties top level field, but this combines output that requires the full basis set specification and data that are simple variables.

I think it makes sense to make a separate top level field for the wavefunction-like data. I would hazard that wavefunction would be a fine top level field. In this first pass it looks like the types of quantities we are focusing on are mean-field quantities. So we could consider something like mean_field_wavefunction or just make sure that quantities listed within are similarly specific as they are in properties.

Where should the basis information be located? There is some discussion that this could (eventually) be a JSON-LD like approach to a basis library (https://www.basissetexchange.org). However, until the basis information is uniform between programs this is not currently possible.

Sorry if this already been discussed before, but are we planning to transform the basis set dependent quantities (e.g. density, fock, etc.) to follow a standardized format (like in basis set exchange) when brought into QCArchive, or are we going to keep the native basis set decisions from the qc code it was generated by?

What are the fields that we should list?

Can we assume that for RHF wave functions, only alpha quantities are present?

I think orbitals and density would be good for visualization. One of my dreams would also be to load orbitals from one qc code into another. For example, do a HF calculation in one code but then run a post-HF method in a different code. To do this we would definitely need the Fock matrix, and then maybe the orbital eigenvalues (although these could be determined from the Fock matrix), and maybe the one-electron Hamiltonian (potentially could be recalculated easily in new code). As for assuming only alpha quantities are present for RHF wavefunctions that would definitely save on space, but would be inconsistent with how properties is currently set up. So to summarize on the quantities: orbitals_alpha, orbitals_beta, density_alpha, density_beta, fock_alpha, fock_beta, orbitals_alpha_eig, orbitals_beta_eig, one_electron_hamiltonian

dgasmith commented 5 years ago

The schema itself would define orbital ordering and normalization. So something like QCEngine would need to translate the orbitals to the standard order and codes outputting the schema would need to conform to this ordering.

My original plan was to simply have the top level wavefunction field which would expand depending on the method (amplitudes for CC* if requested) for example.

As for assuming only alpha quantities are present for RHF wavefunctions that would definitely save on space, but would be inconsistent with how properties is currently set up.

How so? The current properties field does not have any explicit alpha or beta quantities.

dgasmith commented 5 years ago

Based off feedback (mostly offline), I am planning to implement the wavefunction data as-is. Likely the most divisive decision is to skip symmetry in matrix quantities in order to: 1) keep the schema simple 2) prevent strange issues where not all densities are symmetric 3) lower the overhead on the readers and writers of the schema

Of the "remaining issues to work through":

"Orbital ordering (#12)" - The "CCA" ordering is implicitly defined by Libint's default/CCA ordering. We will match our definitions to the libint package.
"Large array representations (#15)" - This will be skipped, as noted the Schema is not necessarily JSON, but any key/value/array syntax. Large arrays can be efficiently packed through other formats such as msgpack/HDF.
"How does the input change to request these quantities?" - Leaving this up to downstream codes for now.

I will make a PR in a few days unless there is additional feedback.

mattwelborn commented 5 years ago

There's an odd input/output asymmetry with basis. In the input schema, basis lives in model.basis and could be a name (e.g. "6-31g") or a full basis set specification. This leaves open the possibility of the program interpreting the name to mean something slightly different than what's on the Basis Set Exchange or used in another program. However, in the output schema, there is a new wavefunction.basis which must fully specify the basis that was used in the calculation. Otherwise, the orbitals, denisty, etc. are not usable independent of the program.

This is fine if wavefunction is an output-only thing. But, one (@sjrl) might want to load in orbitals from a previous calculation, so wavefunction should be in both the input and output schemas. But what happens if in an input schema, the basis is specified in both the model and wavefunction sections?

Perhaps the solution is to remove basis from the wavefunction section, and require that inputs/outputs which include a wavefunction section fully specify their basis set in model.basis. An oddity with this method is that an input with e.g. model.basis = 6-31g would have to replace model.basis with the full description of the basis set if wavefunction quantities are reported.

mattwelborn commented 5 years ago

After thinking about it, it's obvious that the schema for input and output of wavefunction quantities must be different.

mattwelborn commented 4 years ago

Are we ready to close this?

dgasmith commented 4 years ago

Yes, the initial implementation is in. Further comments can spawn new issues.

MolSSI / QCSchema

Wavefunction data #63