atomistic-machine-learning / schnetpack

SchNetPack - Deep Neural Networks for Atomistic Systems
Other
774 stars 214 forks source link

SchNetPack on SMILES only input data #631

Closed CalmScout closed 4 months ago

CalmScout commented 4 months ago

Dear maintainers of schnetpack,

I am exploring the official tutorial notebook tutorial_01_preparing_data.ipynb

Provided example with md17_uracil.npz data contains positions, energies, forces and numbers entries that are required for the construction of ase.atoms.Atoms objects stored in atoms_list:

data = np.load('./md17_uracil.npz')
numbers = data["z"]
atoms_list = []
property_list = []
for positions, energies, forces in zip(data["R"], data["E"], data["F"]):
    ats = Atoms(positions=positions, numbers=numbers)
    properties = {'energy': energies, 'forces': forces}
    property_list.append(properties)
    atoms_list.append(ats)

May I ask if you have any functions for initialization of the SchNetPack with only SMILES molecular representations?

Best regards, Anton.

jnsLs commented 4 months ago

Dear @CalmScout ,

SchNetPack does not support Smiles inputs, because it is designed for 3D structures. You could utilize for example rdkit to get 3D embeddings from smiles and use that as an input for SchNetPack models.

It could look like this:

from ase import Atoms
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

smiles = "CC(=O)NC1=CC=C(C=C1)O"    # paracetamol
# load data
rdkmol = Chem.MolFromSmiles(smiles)
rdkmol = Chem.AddHs(rdkmol)
AllChem.EmbedMolecule(rdkmol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(rdkmol)
# get positions
n_nodes = rdkmol.GetNumAtoms()
pos = []
for i in range(n_nodes):
    conformer_pos = rdkmol.GetConformer().GetAtomPosition(i)
    pos.append([conformer_pos.x, conformer_pos.y, conformer_pos.z])
# get atomic numbers
atomic_numbers = []
for atom in rdkmol.GetAtoms():
    atomic_numbers.append(atom.GetAtomicNum())

at = Atoms(positions=np.array(pos), numbers=np.array(atomic_numbers))

Best, Jonas

CalmScout commented 4 months ago

Dear @jnsLs ,

Thank you for a such prompt reply with a code snippet. Indeed, I used RDKit to generate 3D structure from SMILES, but my solution was not optimal because it was saving .xyz files:

from rdkit import Chem
from rdkit.Chem import AllChem
from pathlib import Path

def generate_xyz_from_smiles(smi: str, dir_to: Path) -> None:
    # check if file already exists
    if (dir_to / f"{smi}.xyz").exists():
        return
    mol = Chem.MolFromSmiles(smi)
    mol_add_h = Chem.AddHs(mol)
    # embed molecule in 3D
    AllChem.EmbedMolecule(mol_add_h)
    # convert molecule to xyz and save to file
    Chem.MolToXYZFile(mol_add_h, str(dir_to / f"{smi}.xyz"))

def read_xyz(path: Path) -> tuple[int, list[int], np.ndarray]:
    """
    Read xyz file and return atomic numbers and corresponding positions.
    """
    with path_xyz.open() as f:
        lines = f.readlines()
        N = int(lines[0])
        lines = lines[2:]
        positions = np.array([[float(_) for _ in line.split()[1:4]] for line in lines])
        symbols = [line.split()[0] for line in lines]
        atomic_numbers = [Chem.GetPeriodicTable().GetAtomicNumber(symbol) for symbol in symbols]
    return N, atomic_numbers, positions

Please let me know if I understand correctly that for the creation of ase.Atoms object we need only 3D coordinates of atoms in the molecule and corresponding atomic numbers as in the example below:

ats = Atoms(positions=positions, numbers=numbers)

If so, the properties like in the example from tutorial notebook:

properties = {'energy': energies, 'forces': forces}

can be viewed as labels to predict? Or, for example, energies can be used during the training of SchNet as well?

Best regards, Anton.

jnsLs commented 4 months ago

Hi Anton,

thats right, you can define an ase.Atoms object by solely providing atom positions and atomic numbers.

Properties such as energy and forces are labels (targets) in your dataset. Each structure (defined by the ase.Atoms object) has a corresponding target value for each property. When you train a model in SchNetPack you train on those targets. I.e., the objective of the training is to minimize the deviation between properties predicted by the model (e.g. SchNet) and the respective traget values in your dataset (energies, forces, ...).

Best, Jonas

CalmScout commented 4 months ago

Hi Jonas,

Thank you for such a great support of SchNetPack package and great explanations!

Best regards, Anton.