atomistic-machine-learning / schnetpack-gschnet

G-SchNet extension for SchNetPack
MIT License
45 stars 8 forks source link

generated molecules have broken chain structures? #8

Closed qinhaigen closed 1 year ago

qinhaigen commented 1 year ago

Thanks for sharing your great work!

I'm confused about the generated compound with many broken chain structures when I has trained 1 epoch or 120 epochs with custom dataset. I use rdkit to extract xyz from db file, however converted molecules have some dots eg. [CH].[CH][C][CH].[C].[C][CH][CH]N[O].[C][C].[C][C]OC1([C])OCO1.[C][N][CH].[H].[H].[O].  What can I do to solve this problem?
NiklasGebauer commented 1 year ago

Hi @qinhaigen ,

this can have many reasons as there are many points where things could go wrong.

Hope this helps! Best regards, Niklas

qinhaigen commented 1 year ago

Hi @NiklasGebauer,

Thank you very much for your valuable feedback and suggestions !

  1. the generated structures are more or less random clusters of atoms.
  2. Out of the 100 molecules generated through one epoch, 52 are unconnected, and more than half of the molecules generated after 120 epochs will also experience chain breakage.
  3. As you said, I need to customize a filter to filter out effective molecules.
  4. I attempted to generate a dataset using the custom training data you provided, and there was no problem using my conversion algorithm to obtain smiles. Additionally, there was no problem using the replied code to view the 3D structure. I use custom conditions in training and generative model that are not included in the conditions you provide, such as qed and logp. For the conditions without units(not present in the ase module), I will follow the following code to provide and generate the required training data(including H C N O F)
    
    mol_list = []
    property_list = []
    for xyz in xyz_files:
    # get molecule information with your custom read_molecule function
    atom_positions, atom_types, property_values = read_molecule(xyz)
    # create ase.Atoms object and append it to the list
    mol = Atoms(positions=atom_positions, numbers=atom_types)
    mol_list.append(mol)
    # create dictionary that maps property names to property values and append it to the list
    # note that the property values need to be numpy float arrays (even if they are scalar values)
    properties = {
        "qed": np.array([float(property_values)]),
        # "energy": np.array([float(property_values[0])]),
        # "gap": np.array([float(property_values[1])]),
    }
    property_list.append(properties)

create empty data base with correct format

make sure to provide the correct units of the positions and properties

custom_dataset = ASEAtomsData.create( "./data/chembl_1000.db", # where to store the data base distance_unit="Angstrom", # unit of positions

property_unit_dict={"energy": "Hartree", "gap": "Hartree"}, # units of properties

# property_unit_dict={"gap": "Hartree"},  # units of properties
property_unit_dict={"qed": ""},  # units of properties

)

write gathered molecules and their properties to the data base

custom_dataset.add_systems(property_list, mol_list)


5. for generation, I use conditions(qed=1.0) including the range of values of the training data(qed: 0.5~1.0).
6. for training, the training data set only consists of valid, connected molecules. Most of the parameters have not been modified, except for changing 1.7 to 1.9. I would be happy to provide the cli.log(In the attachment)[cli.log](https://github.com/atomistic-machine-learning/schnetpack-gschnet/files/11402721/cli.log). If you have any discoveries, I hope you can share them with me. The number of heavy atoms in the molecule I use ranges from 10 to 40, and the number of atoms in the molecule ranges from 10 to 68. I don't know if these molecules of different sizes will cause such a problem.

Look forward to your reply
qinhaigen
NiklasGebauer commented 1 year ago

Hello @qinhaigen ,

if the generated structures are random clusters of atoms, the model has not learned anything useful. In this case, I think it is caused by the small training data set. It only contains 1k molecules, whereas the default hyper-parameters were chosen from training runs on QM9 with 50k training structures. Furthermore, the model was only trained for one epoch in the cli.log you sent, this is definitely not enough to obtain meaningful results. I am not sure if you can train a model on large molecules with so few examples but if you want to try I'd suggest the following:

If possible, increasing the number of molecules in the training data set should definitely help.

Hope this helps! Best regards, Niklas

qinhaigen commented 1 year ago

Hi @NiklasGebauer ,

Thank you very much for your valuable feedback. As you mentioned, I increased the training data from 1000 to 55k and used unconditional training models after adding epochs for training. After training to 28 epochs, 5032 out of 10000 molecules were successfully generated, and 4884 were successfully converted to smiles, including 52 disconnected molecules. The updated model works very well. Next, I will try training and generating models with conditions. I hope it has a good effect. Looking forward to discussing other issues with you next time!

Best wishes, qinhaigen

NiklasGebauer commented 1 year ago

Hi @qinhaigen ,

perfect, glad you are getting good results now! The model can probably get even better by training longer. On QM9 training usually runs for 150-300 epochs (depending on the target properties). Feel free to open an issue if problems occur.

Best regards, Niklas