AtomsDataModule issue with own dataset

Neon7799 commented 1 year ago

Hi, my dataset(nearly 20,000) has H,C,N,O,Zn five elements, but each geometry may have different size. I can convert it from .npz to .db correctly, but when splitting the dataset, i got error message import os from schnetpack.data import ASEAtomsData from ase import Atoms import torch from torch.optim import Adam import schnetpack.transform as trn import numpy as np from schnetpack.data import *

%rm metal.db data = np.load('./metal.npz',allow_pickle=True) atoms_list = [] property_list = [] for numbers, positions, energies in zip(data["Z"], data["R"], data["E"]): ats = Atoms(positions=positions, numbers=numbers) properties = {'energy': energies,} property_list.append(properties) atoms_list.append(ats) atomrefs = { 'energy': [ -0.598680709282,-38.770836588232,-55.473973919248,-73.967208936440, -1805.489171255718 ] }

newdataset = ASEAtomsData.create( './metal.db', distance_unit='Ang', property_unit_dict={'energy':'Ha'}, atomrefs=atomrefs ) newdataset.add_systems(property_list, atoms_list)

example = newdataset[0] for k, v in example.items(): print('-', k, ':', v.shape)

Print information of one geometry :
- _idx : torch.Size([1])
- energy : torch.Size([1])
- _n_atoms : torch.Size([1])
- _atomic_numbers : torch.Size([45])
- _positions : torch.Size([45, 3])
- _cell : torch.Size([1, 3, 3])
- _pbc : torch.Size([3])

data_module = AtomsDataModule(datapath='metal.db',format=AtomsDataFormat.ASE, batch_size=100, num_train=10000, num_val=5000, transforms=[ trn.ASENeighborList(cutoff=5.), trn.RemoveOffsets("energy", remove_mean=True, remove_atomrefs=True), trn.CastTo32() ], num_workers=1, pin_memory=False, property_units={'energy':'Ha'}, distance_unit="Ang", load_properties=["energy"], ) data_module.prepare_data()

data_module.setup() #error from this line code

Structure.Z is the total number of atoms in each molecule, and 'size 5' means five atom reference energy, but i don't know how to fix the issue. I tested QM9 dataset on QM9 module, it worked well, but when I tested uracil.npz in the tutorial, i got the same error.

owen-rett commented 1 year ago

I only just started using Spk within the past week, but also ran into a similar problem. I think Spk wants to be able to call atom_ref[Z], where Z is the atomic number of the species. I solved this by basically setting the atom_ref list to be a list of zeroes as long as the largest atomic number I would need to use. Setting the max index to be Uranium, just as its the largest element I've seen in a MLIAP.

lst_ats=[0.0 for i in range(0,93)]
lst_ats[1] = E_H
lst_ats[8] = E_O
lst_ats[40] = E_Zr
atom_refs = {
    'energy': lst_ats
}

And then just inserting that dict into the dataset creation.

Edit: Found this old issue while looking up something else, which provides a more direct answer https://github.com/atomistic-machine-learning/schnetpack/issues/218#issuecomment-597250608

Neon7799 commented 1 year ago

Thanks, it's really helpful

atomistic-machine-learning / schnetpack

AtomsDataModule issue with own dataset #509