qinhaigen commented 1 year ago

Thanks for sharing your great work!

I'm confused about the generated compound with many broken chain structures when I has trained 1 epoch or 120 epochs with custom dataset. I use rdkit to extract xyz from db file, however converted molecules have some dots eg. [CH].[CH][C][CH].[C].[C][CH][CH]N[O].[C][C].[C][C]OC1([C])OCO1.[C][N][CH].[H].[H].[O].  What can I do to solve this problem？

NiklasGebauer commented 1 year ago

Hi @qinhaigen ,

this can have many reasons as there are many points where things could go wrong.

Please check the generated structures visually. You can do this with ase:
```
ase gui <path/to/.db>
```
Do the generated structures mostly make sense or are all of them just more or less random clusters of atoms? After one epoch I would still expect many flaws, after 120 epochs the model should generate mostly reasonable structures. How many of the generated molecules have this problem? When training on QM9 without any conditions, we observe 20-25% of the generated structures are invalid (for example because they are disconnected). So a certain rate of invalid structures is expected. Unfortunately, there is no easy way to define a filter that works on all kinds of data sets, so the analysis of generated structures is not part of the package. Therefore, it is up to you to find a way to filter them out.
If most of the generated structures do look valid, the algorithm that converts them to SMILES might be bad. How reliable is the algorithm that you are using to obtain SMILES strings from the 3d structure. This is a non-trivial problem. Does this conversion work for all molecules in your training data set?
If you can already see that all the generated structures are nonsense, then something is going wrong either during training or during generation. For generation:
- Are you using conditions? If so, you might provide target values that are far outside the range of values of the training data. This can also happen accidentally if you provide the target values in other units than the model was trained on (e.g. Hartree instead of eV). This can cause unexpected behavior. For training:
- Does the training data set only consist of valid, connected molecules?
- Which hyper-parameters did you change? Did you adapt globals.placement_cutoff according to your data?
- Maybe the way the training data base was build is causing troubles upon loading structures. If you append the cli.log I can do a quick check and maybe spot if something was odd during training.
- How large are your structures on average? If they are very large, the model might be failing. I've made many adaptations to make training on larger molecules work in theory but we did not yet get the chance to test this in practice.

Hope this helps! Best regards, Niklas

qinhaigen commented 1 year ago

Hi @NiklasGebauer,

Thank you very much for your valuable feedback and suggestions !

the generated structures are more or less random clusters of atoms.
Out of the 100 molecules generated through one epoch, 52 are unconnected, and more than half of the molecules generated after 120 epochs will also experience chain breakage.
As you said, I need to customize a filter to filter out effective molecules.

I attempted to generate a dataset using the custom training data you provided, and there was no problem using my conversion algorithm to obtain smiles. Additionally, there was no problem using the replied code to view the 3D structure. I use custom conditions in training and generative model that are not included in the conditions you provide, such as qed and logp. For the conditions without units(not present in the ase module), I will follow the following code to provide and generate the required training data(including H C N O F)


mol_list = []
property_list = []
for xyz in xyz_files:
# get molecule information with your custom read_molecule function
atom_positions, atom_types, property_values = read_molecule(xyz)
# create ase.Atoms object and append it to the list
mol = Atoms(positions=atom_positions, numbers=atom_types)
mol_list.append(mol)
# create dictionary that maps property names to property values and append it to the list
# note that the property values need to be numpy float arrays (even if they are scalar values)
properties = {
    "qed": np.array([float(property_values)]),
    # "energy": np.array([float(property_values[0])]),
    # "gap": np.array([float(property_values[1])]),
}
property_list.append(properties)

create empty data base with correct format

make sure to provide the correct units of the positions and properties

custom_dataset = ASEAtomsData.create( "./data/chembl_1000.db", # where to store the data base distance_unit="Angstrom", # unit of positions

property_unit_dict={"energy": "Hartree", "gap": "Hartree"}, # units of properties

# property_unit_dict={"gap": "Hartree"},  # units of properties
property_unit_dict={"qed": ""},  # units of properties

)

write gathered molecules and their properties to the data base

custom_dataset.add_systems(property_list, mol_list)


5. for generation, I use conditions(qed=1.0) including the range of values of the training data(qed: 0.5~1.0).
6. for training, the training data set only consists of valid, connected molecules. Most of the parameters have not been modified, except for changing 1.7 to 1.9. I would be happy to provide the cli.log(In the attachment)[cli.log](https://github.com/atomistic-machine-learning/schnetpack-gschnet/files/11402721/cli.log). If you have any discoveries, I hope you can share them with me. The number of heavy atoms in the molecule I use ranges from 10 to 40, and the number of atoms in the molecule ranges from 10 to 68. I don't know if these molecules of different sizes will cause such a problem.

Look forward to your reply
qinhaigen

NiklasGebauer commented 1 year ago

Hello @qinhaigen ,

if the generated structures are random clusters of atoms, the model has not learned anything useful. In this case, I think it is caused by the small training data set. It only contains 1k molecules, whereas the default hyper-parameters were chosen from training runs on QM9 with 50k training structures. Furthermore, the model was only trained for one epoch in the cli.log you sent, this is definitely not enough to obtain meaningful results. I am not sure if you can train a model on large molecules with so few examples but if you want to try I'd suggest the following:

Start with a smaller model, e.g. reduce the number of features and interaction blocks in the representation (for example, set model.representation.n_atom_basis=64 and model.representation.n_interactions=6).
Train for much longer than 1 epoch.
Train without conditions to test if the model can capture the general distribution of the structures. If an unconditioned model generates reasonable structures, you can train a new model with similar settings and include conditions (e.g. qed). For a conditioned model, I'd start to generate structures with the mean target property value as a sanity check before moving towards more extreme values.

If possible, increasing the number of molecules in the training data set should definitely help.

Hope this helps! Best regards, Niklas

qinhaigen commented 1 year ago

Hi @NiklasGebauer ,

Thank you very much for your valuable feedback. As you mentioned, I increased the training data from 1000 to 55k and used unconditional training models after adding epochs for training. After training to 28 epochs, 5032 out of 10000 molecules were successfully generated, and 4884 were successfully converted to smiles, including 52 disconnected molecules. The updated model works very well. Next, I will try training and generating models with conditions. I hope it has a good effect. Looking forward to discussing other issues with you next time!

Best wishes, qinhaigen

NiklasGebauer commented 1 year ago

Hi @qinhaigen ,

perfect, glad you are getting good results now! The model can probably get even better by training longer. On QM9 training usually runs for 150-300 epochs (depending on the target properties). Feel free to open an issue if problems occur.

Best regards, Niklas

atomistic-machine-learning / schnetpack-gschnet

generated molecules have broken chain structures? #8

create empty data base with correct format

make sure to provide the correct units of the positions and properties

property_unit_dict={"energy": "Hartree", "gap": "Hartree"}, # units of properties

write gathered molecules and their properties to the data base