How to convert extxyz file to .db file?

I have a extxyz file and I want to input it to schnetpack to train a model. I follow this link https://schnetpack.readthedocs.io/en/latest/tutorials/tutorial_01_preparing_data.html, but there are so many questions. Here is my convert script:

    atoms = ase.io.read("total.extxyz", index=":")
    property_list = []
    atom_list = []
    atomref = {}
    for atom in atoms:
        properties = {'energy': [atom.get_potential_energy()], 'forces': atom.get_forces().tolist(), 'energy_U0': [atom.get_potential_energy()]}

        for key in properties:
            if key not in atomref:
                atomref[key] = properties[key]
            else:
                atomref[key] = atomref[key] + properties[key]

        property_list.append(properties)
        atom_list.append(atom)

    if os.path.exists('./total.db'):
        os.remove('./total.db')
    new_dataset = ASEAtomsData.create('./total.db',
                                      distance_unit='Ang',
                                      property_unit_dict={'energy':'kcal/mol', 'forces':'kcal/mol/Ang', 'energy_U0': 'kcal/mol'},
                                      atomrefs=atomref)
    new_dataset.add_systems(property_list, atom_list)

Questions:

what is atomref? Aren't the energy and forces themselves reference data?

why must I specify a energy_U0 field? Otherwise an exception will be thrown

File "/home/dym/.conda/envs/painn-schnetpack/lib/python3.10/site-packages/schnetpack/data/atoms.py", line 401, in <dictcomp>
arefs = {k: self.conversions[k] * torch.tensor(v) for k, v in arefs.items()}
KeyError: 'energy_U0'

it seems that the .db file needs so many fields, such as _offset, see the exception:

File "/home/dym/.conda/envs/painn-schnetpack/lib/python3.10/site-packages/schnetpack/atomistic/distances.py", line 16, in forward
offsets = inputs[properties.offsets]
KeyError: '_offsets'

wiki shows a list of properties of QM9, must I parse all the information from the extxyz file?
```
Number of reference calculations: 133770
Available properties:
- energy
- forces
```

Properties of molecule with id 0:

_idx : torch.Size([1])
energy : torch.Size([1])
forces : torch.Size([12, 3])
_n_atoms : torch.Size([1])
_atomic_numbers : torch.Size([12])
_positions : torch.Size([12, 3])
_cell : torch.Size([1, 3, 3])
_pbc : torch.Size([3])

Here is my dataset: total.zip

I didn't find any detailed documentations about how to convert my data file to .db file, all I had done is based on reverse engineering. So the code and questions may looked a little bit weired.

Thank you.

Hi @nahso,

To create a dataset, you will need to parse your xyz file and provide the following data: atomic numbers, positions, cell (if you use PBC) and your properties (e.g. energy or forces. this depends on your data). Furthermore, you will need to provide a property_unit_dict that maps every property to a physical unit. Then you can create the database file like in the tutorial example before. In the following, I will try to answer your questions one by one.

You can find more details about the atomref values in tutorial 2. They try to estimate the average energy contribution per atom type. If you do not have this data available, or if like in your example only one atom type is present, just ignore the atomrefs in the add_systems function.
Similar to 1. you should not add the atomrefs for your dataset. And in the property_unit_dict you still have the key energy_U0, but this property is not present in your data. So just remove this.
When the _offset key is missing, this is usually because you train a model on a datamodule, where the neighborlist is missing.
See the introduction of this post.

I hope this helps to solve your issue.

atomistic-machine-learning / schnetpack

How to convert extxyz file to .db file? #668