Some confusion about training other properties on custom data

Dear Dr. Niklas Gebauer,

I've recently been using your project to train a custom dataset and trying to apply it on new attributes. During this process, I encountered some doubts about property training, especially how to handle custom properties that are not included in schnetpack's properties.py file.

My understanding is that the data training part of the project mainly relies on the atomic coordinates and atomic number information of the molecules. schnetpack seems to be trained mainly on its built-in properties (such as force, energy, energy gap, etc.). Therefore, I would like to ask a few questions:

Processing of custom attributes: If the data I want to train contains custom attributes that are not within the scope defined by schnetpack, do I need to modify some .py files in the schnetpack package (for example, define custom data like qm9.py kind)? Or is it enough to just convert the custom dataset into ASE database format and add all the necessary information (atomic coordinates, atomic numbers and custom properties)?
Adjustment of configuration files: Under the guidance of your tutorial, I understand that I do not need to modify any files inside schnetpack. My steps are to convert the custom data set into ASE database format and add all the atomic coordinates, atomic numbers, and custom attributes into it. After that I should use the template config file you provided and fill in those sections marked ??? based on the comments. Is this understanding correct?
Attribute prediction process: For deep learning models, before predicting custom attributes, the model needs to correctly identify these attributes. Can you elaborate on how the model outputs its predictions? Specifically, what is the output process of attribute predicted values?
Verify the accuracy of predictions: Can I generate molecules from a model trained on a custom data set, and then select a subset of them to verify the accuracy of the predictions through quantum chemical calculations? I'm a bit confused by this attribute prediction process and would love your guidance. Thank you very much for taking the time out of your busy schedule to read my question and hope to get your reply.

good luck,

[ldhkc]

Hi @ldhkc ,

sorry for the delayed reply. I assume that you mean sampling of molecules with specific target properties when you ask about the prediction process. Because cG-SchNet does not predict property values, but it takes property values as additional inputs to the model, which influences the predicted probabilities of atom types and atom positions in the sampling of molecules. In this way, we can sample molecules that are likely to exhibit the properties that we provided to the model as input.

For example, if we give HOMO-LUMO gap=5 eV to the model, it should sample many structures that have HOMO-LUMO gap close to 5 eV. However, the model will not give you a predicted HOMO-LUMO gap value. If you want to predict the HOMO-LUMO gap of a structure instead, you can train a SchNet or PaiNN model with the main package schnetpack (https://github.com/atomistic-machine-learning/schnetpack).

Regarding the specific questions:

You do not have to change any .py files. You only need to convert your data base into ASE format. However, you need to make sure that the metadata of the data base file contains entries for both a _property_unit_dict and a _distance_unit. For example, if your data set has structures with coordinates in Angstrom and a custom attribute called my_attribute with unit eV, the metadata would look like: {'_property_unit_dict': {'my_attribute': 'eV'}, '_distance_unit': 'Ang'}. A more detailed example of this is given in the readme: https://github.com/atomistic-machine-learning/schnetpack-gschnet?tab=readme-ov-file#using-custom-data
Yes, exactly. As mentioned in 1., it is mandatory that you list all attributes of molecules in the data base in the _property_unit_dict in the metadata. Otherwise, our code will not be able to access your custom properties.
As mentioned above, cG-SchNet does not predict property values. However, it is required to embed the input property into vector space in cG-SchNet. So for every set of custom attributes that you want to use for training cG-SchNet, you need to create a conditioning config file that explains how to embed these attributes in the model. In the default template config file, there is no conditioning (i.e. you train without target attributes). So after writing your custom conditioning config my_conditioning.yaml, you have to add it to the template config. Just replace the line - override /model/conditioning: null with -override /model/conditioning: my_conditioning. You can find more details on the conditioning configs in the readme: https://github.com/atomistic-machine-learning/schnetpack-gschnet/blob/main/README.md#specifying-target-properties If your attribute is scalar, we have implemented an embedding using Gaussian expansion of the scalar value. We also have a module for vector valued attributes (e.g. fingerprints). If your attribute has a very specific shape or meaning, you might need to invent another way to embed it. Only then you will have to adjust the python code by writing your own ConditionEmbedding.
Yes, this is the proper way to assess whether the trained model works. However, the evaluation with quantum chemical calculations is not part of this package. You'll have to take the .db file of generated molecules and then apply your own code to run the calculations for verification. Alternatively, you can also train a SchNet or PaiNN model on your dataset as a surrogate for a quick evaluation or for filtering only the most interesting generated molecules for more costly evaluations. I'm currently working on a script that filters out invalid molecules from the data base of generated structures using rdkit but it will take a bit more until I release it here. However, this will only filter out generated nonsense, it will not provide attribute values.

Hope this is helpful! Best, Niklas

PS.: If your attribute does not have a unit or the unit does not exist in ASE, you can also write "my_attribute": "" in the _property_unit_dict. But the attribute has to be listed in there.

@NiklasGebauer Thank you very much for your detailed response to my question in your busy schedule, your reply has resolved many doubts in my mind, thank you again for your patience. After reading your reply, I still have some questions that need further clarification:

1、Regarding the metadata you mentioned, does this metadata refer to the metadata discussed in the ASE database? Specifically as shown in 2、Regarding the paper "Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules", I tried to run the code according to the help documentation you provided on GitHub. The code ran and automatically generated a molecular data file named qm9.db. When I opened this file with DB Browser for SQLite, I noticed that the atomic coordinates and atomic numbers are stored in the table named systems, while the molecule's property data is stored in the cells of the data column of the table, as shown in red in the property_1 . I would like to ask, by mentioning metadata, do you mean storing these molecule properties and their units in this location? Or should they be stored in a table called information as shown in the property_2 ? 3、You mentioned {'_property_unit_dict': {'my_attribute': 'eV'}, '_distance_unit': 'Ang'}, and I'm currently only using one attribute. Would this qualify if stored in the data format {'my_attribute': 'my_attribute_unit'} as mentioned in the second question? If I have multiple attributes, should I store them in the following format: {'my_attribute_1': 'my_attribute_unit_1', 'my_attribute_2': 'my_attribute_unit_2'}? 4、Regarding your fourth reply, did you develop a script that determines the validity of a molecule based on connectivity and valence checks of the molecular structure? Or do you add more bases of judgment on top of that?

I look forward to your answering these questions at your convenience, and thank you very much for your help and support.

Dear @ldhkc ,

Yes, this is exactly the metadata I was talking about. This is where you need to add the _property_unit_dict and the _distance_unit.
The qm9.db is a good starting point, it has exactly the required format.You should store your data in the same way: The atomic coordinates and numbers of the molecules are stored as systems. The data field contains the actual attribute values of the molecule. So yes, you need to store the data there. The information part only contains the metadata, i.e., the _property_unit_dict and the _distance_unit, not the property values of each molecule. You can also use schnetpack to write your data set in the correct format following our tutorials: https://schnetpack.readthedocs.io/en/latest/tutorials/tutorial_01_preparing_data.html#Preparing-your-own-data
Yes, this is correct. If you have multiple attributes, the metadata would look like this: {'_property_unit_dict': {'my_attribute_1': 'my_attribute_unit_1', 'my_attribute_2': 'my_attribute_unit_2'}, '_distance_unit': 'my_distance_unit'}.
The script checks the valence of the molecules based on the connectivity obtained with rdkits rdDetermineBonds, which is an implementation of xyz2mol. It can also check the uniqueness and novelty of molecules by comparing the obtained SMILES strings. Aside from that, I did not add any other metrics/ways of evaluation.

Best, Niklas

@NiklasGebauer Thank you very much for your detailed response, it has helped me resolve many of my doubts. I may need to ask you again if there are any other questions that follow. Thanks again for your excellent help and support.

atomistic-machine-learning / schnetpack-gschnet

Some confusion about training other properties on custom data #13