aiqm / torchani

Accurate Neural Network Potential on PyTorch
https://aiqm.github.io/torchani/
MIT License
466 stars 129 forks source link

generate dataset for torchani #622

Open MichailDanikas opened 2 years ago

MichailDanikas commented 2 years ago

Hi, I have a problem creating my own dataset to use them later for training. I'm a begginer with h5py but I don't understand how the datasets should be formated. I am trying to use the last part of #611 where my species look like this: array([['O', 'C', 'O'], ['O', 'C', 'O'],... ['O', 'C', 'O']]) for one molecule. The coordinates are in the from: [array([[[ 0. , 0. , 1.237479], [ 0. , 0. , -0.3 ], [ 0. , 0. , -1.237479]]]),...] and the energies: [array(226.56324331), array(208.34163576), array(191.23083335),...] I've also tried other formats which I saved them using: torchani.data._pyanitools.datapacker('./path_to_file', mode = 'w') which after load them with: torchani.data.load('./path_to_file') they were tranformed as dictionaries as the examples in ani_gdb_s01.h5 do. However, in the training part the following error is prompted: image If you have any suggestion please let me know. Thank you in advance.

jvita commented 1 year ago

Probably a bit late for the original poster, but here's what I do to convert from a list of ASE.Atoms objects. I'm not sure if it's 100% correct, but it seems to work fine.


# `train` is a list of ASE.Atoms objects
with h5py.File('train.hdf5', 'w') as hdf5:
    for i, atoms in enumerate(train):
        natoms = len(atoms)

        g = hdf5.create_group(str(i))

        g.create_dataset('energies', data=np.atleast_1d(atoms.info['energy']))
        g.create_dataset('cell', data=np.array(atoms.cell).reshape((1, 3, 3)))
        g.create_dataset('coordinates', data=atoms.positions.reshape((1, natoms, 3)))
        g.create_dataset('force', data=atoms.arrays['forces'].reshape((1, natoms, 3)))
        g.create_dataset('species', data=[b'C']*natoms)